From Raw to Ready: Industrial Fault Data Enhancement Via Preprocessing and Balancing

Authors

  • Suroor M. Albattat College of Engineering, Al-Iraqi University, Saba’a Abkar Complex, Baghdad, Iraq
  • Baraa M. Albaker College of Engineering, Al-Iraqi University, Saba’a Abkar Complex, Baghdad, Iraq
  • Malik A. Alsaedi College of Engineering, Al-Iraqi University, Saba’a Abkar Complex, Baghdad, Iraq
Volume: 15 | Issue: 5 | Pages: 28313-28323 | October 2025 | https://doi.org/10.48084/etasr.12784

Abstract

In recent years, predictive maintenance has emerged as a critical component for improving the efficiency and reliability of industrial systems. However, much of the existing research has primarily emphasized model development, often overlooking the fundamental role of data quality and class distribution in shaping predictive performance. To address this gap, this study proposes an integrated preprocessing framework that ensures high-quality data readiness across all stages. A case study was conducted on a dataset of industrial sensors for fault prediction. The preprocessing pipeline involved handling missing values using K-Nearest Neighbors (KNN), detecting outliers with Isolation Forest (IF), and correcting abnormal values through the Clipping method. To address data imbalance, synthetic data were generated using Generative Adversarial Networks (GAN), Variational Autoencoders (VAE), and a hybrid GAN-VAE model that leverages the strengths of both approaches. The hybrid GAN-VAE demonstrated superior data generation performance, yielding the highest Pearson correlation and best Kernel Density Estimation (KDE) fit, thereby ensuring dataset reliability for training. The effectiveness of the preprocessing framework was validated using a 1-Dimensional Convolutional Neural Network (1D-CNN) classifier, which achieved a high accuracy of 98.83%.

Keywords:

data preprocessing, imbalanced data, machine learning, outliers, Generative Adversarial Network (GAN)

Downloads

Download data is not yet available.

References

F. Duan, S. Zhang, Y. Yan, and Z. Cai, "An Oversampling Method of Unbalanced Data for Mechanical Fault Diagnosis Based on MeanRadius-SMOTE," Sensors, vol. 22, no. 14, Jul. 2022, Art. no. 5166.

S. B. Belhaouari, A. Islam, K. Kassoul, A. Al-Fuqaha, and A. Bouzerdoum, "Oversampling techniques for imbalanced data in regression," Expert Systems with Applications, vol. 252, Oct. 2024, Art. no. 124118.

A. Islam, S. B. Belhaouari, A. U. Rehman, and H. Bensmail, "KNNOR: An oversampling technique for imbalanced datasets," Applied Soft Computing, vol. 115, Jan. 2022, Art. no. 108288.

A. Amin et al., "Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study," IEEE Access, vol. 4, pp. 7940–7957, 2016.

Y. Fathy, M. Jaber, and A. Brintrup, "Learning With Imbalanced Data in Smart Manufacturing: A Comparative Analysis," IEEE Access, vol. 9, pp. 2734–2757, 2021, https://doi.org/10.1109/ACCESS.2020.3047838.

J. Kafunah, M. I. Ali, and J. G. Breslin, "Handling Imbalanced Datasets for Robust Deep Neural Network-Based Fault Detection in Manufacturing Systems," Applied Sciences, vol. 11, no. 21, Oct. 2021, Art. no. 9783.

S. Singh and M. T. U. Haider, "Pre-processing of datasets with best feature selection and outlier removal techniques for a fair and robust model of software defect prediction." In Review, May 2022.

O. Celik, M. Hasanbasoglu, M. S. Aktas, O. Kalipsiz, and A. N. Kanli, "Implementation of Data Preprocessing Techniques on Distributed Big Data Platforms," in 2019 4th International Conference on Computer Science and Engineering (UBMK), Samsun, Turkey, Sep. 2019, pp. 73–78.

M. Zhang, X. Li, and L. Wang, "An Adaptive Outlier Detection and Processing Approach Towards Time Series Sensor Data," IEEE Access, vol. 7, pp. 175192–175212, 2019.

A. Jain, L. Patil, and P. Dandannavar, "Big Data Preprocessing – A Survey of Existing and Latest Outlier Detection Techniques," International Journal of Emerging Technology in Computer Science & Electronics (IJETCSE), vol. 14, no. 2, pp. 178–181, Apr. 2015.

C. Lartey, J. Liu, R. K. Asamoah, C. Greet, M. Zanin, and W. Skinner, "Effective Outlier Detection for Ensuring Data Quality in Flotation Data Modelling Using Machine Learning (ML) Algorithms," Minerals, vol. 14, no. 9, Sep. 2024, Art. no. 925.

S. Cofre-Martel, E. L. Droguett, and M. Modarres, "Big Machinery Data Preprocessing Methodology for Data-Driven Models in Prognostics and Health Management," Sensors, vol. 21, no. 20, Oct. 2021, Art. no. 6841.

A. Hakami, "Strategies for overcoming data scarcity, imbalance, and feature selection challenges in machine learning models for predictive maintenance," Scientific Reports, vol. 14, no. 1, Apr. 2024, Art. no. 9645.

A. Sharma, P. K. Singh, and R. Chandra, "SMOTified-GAN for Class Imbalanced Pattern Classification Problems," IEEE Access, vol. 10, pp. 30655–30665, 2022.

Q. Liu, G. Ma, and C. Cheng, "Data Fusion Generative Adversarial Network for Multi-Class Imbalanced Fault Diagnosis of Rotating Machinery," IEEE Access, vol. 8, pp. 70111–70124, 2020.

A. Abraham, H. S. Mohideen, and R. Kayalvizhi, "A Tabular Variational Auto Encoder-Based Hybrid Model for Imbalanced Data Classification With Feature Selection," IEEE Access, vol. 11, pp. 122760–122771, 2023.

M. Yin, J. Tian, Y. Wang, and J. Jiang, "A Novel Distributed Process Monitoring Framework of VAE-Enhanced with Deep Neural Network," Neural Processing Letters, vol. 56, no. 2, Mar. 2024, Art. no. 118.

S. Chatterjee and Y.-C. Byun, "Leveraging generative adversarial networks for data augmentation to improve fault detection in wind turbines with imbalanced data," Results in Engineering, vol. 25, Mar. 2025, Art. no. 103991.

Y. S. Hindistan and E. F. Yetkin, "A Hybrid Approach With GAN and DP for Privacy Preservation of IIoT Data," IEEE Access, vol. 11, pp. 5837–5849, 2023.

M. A. Hailan, B. M. Albaker, and M. S. Alwan, "Two-Dimensional Transformation of a Conventional Manufacturer into a Smart Manufacturer: Architectonic Design, Maintenance Strategies and Applications," Al-Iraqia Journal for Scientific Engineering Research, vol. 1, no. 1, pp. 77–87, Sep. 2022.

A. Fadlil, Herman, and D. Praseptian M, "K Nearest Neighbor Imputation Performance on Missing Value Data Graduate User Satisfaction," Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 6, no. 4, pp. 570–576, Aug. 2022.

M. F. S. AlRijeb, M. L. Othman, A. Ishak, M. K. Hassan, and B. M. Albaker, "Machine Learning-Driven Soft Sensor Implementation for Real-Time Fault Detection in CDU of Oil Refinery," Engineering, Technology & Applied Science Research, vol. 15, no. 1, pp. 20425–20432, Feb. 2025.

P. M. Goad, P. J. Deore, and V. B. Patil, "A novel approach for detecting outliers by using Isolation Forest with reducing under fitting issue." In Review, Dec. 2022.

H. A. H. Al-Najjar, B. Pradhan, R. Sarkar, G. Beydoun, and A. Alamri, "A New Integrated Approach for Landslide Data Balancing and Spatial Prediction Based on Generative Adversarial Networks (GAN)," Remote Sensing, vol. 13, no. 19, Oct. 2021, Art. no. 4011.

F. Meghdouri, T. Schmied, T. Gärtner, and T. Zseby, "Controllable Network Data Balancing with GANs," in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, Dec. 2021.

A. Von Birgelen, D. Buratti, J. Mager, and O. Niggemann, "Self-Organizing Maps for Anomaly Localization and Predictive Maintenance in Cyber-Physical Production Systems," Procedia CIRP, vol. 72, pp. 480–485, 2018.

D. Zou et al., "Outlier detection and data filling based on KNN and LOF for power transformer operation data classification," Energy Reports, vol. 9, pp. 698–711, Sep. 2023.

E. F. Hadi, M. Z. Bin Baharuddin, and A. W. M. Zuhdi, "Advancing Predictive Maintenance: Median-Based Particle Filtering in MOSFET Prognostics," Journal Européen des Systèmes Automatisés, vol. 57, no. 4, pp. 1103–1117, Aug. 2024.

H. Perez and J. H. M. Tah, "Improving the Accuracy of Convolutional Neural Networks by Identifying and Removing Outlier Images in Datasets Using t-SNE," Mathematics, vol. 8, no. 5, Apr. 2020, Art. no. 662.

A. Popov et al., "Reduced Graphene Oxide and Polyaniline Nanofibers Nanocomposite for the Development of an Amperometric Glucose Biosensor," Sensors, vol. 21, no. 3, Feb. 2021, Art. no. 948.

J. E. Choi, D. H. Seol, C. Y. Kim, and S. J. Hong, "Generative Adversarial Network-Based Fault Detection in Semiconductor Equipment with Class-Imbalanced Data," Sensors, vol. 23, no. 4, Feb. 2023, Art. no. 1889.

S. Stocksieker, D. Pommeret, and A. Charpentier, "Data Augmentation with Variational Autoencoder for Imbalanced Dataset." arXiv, Dec. 2024.

J.-H. Lee, J.-H. Lee, C.-J. Lee, S.-L. Lee, J.-P. Kim, and J.-H. Jeong, "A Study on Wheel Member Condition Recognition Using 1D–CNN," Sensors, vol. 23, no. 23, Nov. 2023, Art. no. 9501.

K. Yan and X. Zhou, "Chiller faults detection and diagnosis with sensor network and adaptive 1D CNN," Digital Communications and Networks, vol. 8, no. 4, pp. 531–539, Aug. 2022.

M. Kassem and B. M. Albaker, "Efficient Classification Model of Pneumonia Infection Based on Deep Transfer Learning and Chest X-Ray Images," Al-Iraqia Journal for Scientific Engineering Research, vol. 1, no. 1, pp. 58–67, Sep. 2022.

Downloads

How to Cite

[1]
S. M. Albattat, B. M. Albaker, and M. A. Alsaedi, “From Raw to Ready: Industrial Fault Data Enhancement Via Preprocessing and Balancing”, Eng. Technol. Appl. Sci. Res., vol. 15, no. 5, pp. 28313–28323, Oct. 2025.

Metrics

Abstract Views: 32
PDF Downloads: 8

Metrics Information