A Scalable Big Data-Driven Distributed Deep Learning Framework for Breast Cancer Diagnosis Using Big Data Analytics
Received: 1 June 2025 | Revised: 9 July 2025 and 28 July 2025 | Accepted: 1 August 2025 | Online: 23 August 2025
Corresponding author: Sarah Kaleem
Abstract
The accurate and early detection of breast cancer remains a significant challenge in medical diagnostics, primarily due to the complexity of histopathological images and the large volume of data involved. This paper presents a novel hybrid deep learning framework that leverages Big Data Analytics (BDA) and Convolutional Neural Networks (CNNs) to enhance the accuracy of breast cancer detection. The proposed system integrates three robust deep learning architectures (VGG16, VGG19, and ResNet50) trained in parallel across distributed nodes using Apache Spark, thereby accelerating computation and enabling scalable learning. This study used the BreakHis dataset, which contains 15,918 original images collected at four magnifications. To enhance generalization and class balance, extensive data augmentation and patch extraction were applied, which expanded the dataset to approximately 275,000 training samples. The hybrid model demonstrated high performance in classification tasks, achieving high precision, recall, and F1-scores compared to existing benchmarks. Key performance indicators, such as accuracy, specificity, and sensitivity, confirm the effectiveness of the model in distinguishing between benign and malignant cases. Unlike traditional monolithic CNN approaches, the proposed system leverages distributed processing to reduce training time while efficiently handling massive datasets.
Keywords:
big data, breast cancer, distributed learning, deep CNNDownloads
References
C. Santucci et al., "European cancer mortality predictions for the year 2025 with focus on breast cancer," Annals of Oncology, vol. 36, no. 4, pp. 460–468, Apr. 2025.
S. E. Robertson et al., "Comparing Lung Cancer Screening Strategies in a Nationally Representative US Population Using Transportability Methods for the National Lung Cancer Screening Trial," JAMA Network Open, vol. 7, no. 1, Jan. 2024, Art. no. e2346295.
D. Mastrodicasa et al., "Use of AI in Cardiac CT and MRI: A Scientific Statement from the ESCR, EuSoMII, NASCI, SCCT, SCMR, SIIM, and RSNA," Radiology, vol. 314, no. 1, Jan. 2025, Art. no. e240516.
M. A. Wahed, M. Alqaraleh, M. S. Alzboon, and M. S. Al-Batah, "Evaluating AI and Machine Learning Models in Breast Cancer Detection: A Review of Convolutional Neural Networks (CNN) and Global Research Trends," LatIA, vol. 3, pp. 117–117, Jan. 2025.
D. Tsietso, A. Yahya, R. Samikannu, B. Qureshi, and M. Babar, "Computational Approach for Automated Segmentation and Classification of Region of Interest in Lateral Breast Thermograms," Computers, Materials and Continua, vol. 80, no. 3, pp. 4749–4765, Sep. 2024.
D. Tsietso et al., "Multi-Input Deep Learning Approach for Breast Cancer Screening Using Thermal Infrared Imaging and Clinical Data," IEEE Access, vol. 11, pp. 52101–52116, 2023.
S. Kaleem, A. Sohail, M. U. Tariq, and M. Asim, "An Improved Big Data Analytics Architecture Using Federated Learning for IoT-Enabled Urban Intelligent Transportation Systems," Sustainability, vol. 15, no. 21, Jan. 2023, Art. no. 15333.
M. T. J. Mehedy et al., "Big Data and Machine Learning in Healthcare: A Business Intelligence Approach for Cost Optimization and Service Improvement," The American Journal of Medical Sciences and Pharmaceutical Research, vol. 7, no. 03, pp. 115–135, Mar. 2025.
K. J. Merceedi and N. A. Sabry, "A Comprehensive Survey for Hadoop Distributed File System," Asian Journal of Research in Computer Science, pp. 46–57, Aug. 2021.
K. C. Burçak, Ö. K. Baykan, and H. Uğuz, "A new deep convolutional neural network model for classifying breast cancer histopathological images and the hyperparameter optimisation of the proposed model," The Journal of Supercomputing, vol. 77, no. 1, pp. 973–989, Jan. 2021.
G. Hamed, M. A. E. R. Marey, S. E. S. Amin, and M. F. Tolba, "Deep Learning in Breast Cancer Detection and Classification," in Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV2020), 2020, pp. 322–333.
M. Sreevani and R. Latha, "A Deep Learning with Metaheuristic Optimization-Driven Breast Cancer Segmentation and Classification Model using Mammogram Imaging," Engineering, Technology & Applied Science Research, vol. 15, no. 1, pp. 20342–20347, Feb. 2025.
A. Naz, H. Khan, I. U. Din, A. Ali, and M. Husain, "An Efficient Optimization System for Early Breast Cancer Diagnosis based on Internet of Medical Things and Deep Learning," Engineering, Technology & Applied Science Research, vol. 14, no. 4, pp. 15957–15962, Aug. 2024.
A. Bekkouche, M. Merzoug, M. Hadjila, and W. Ferhi, "Towards Early Breast Cancer Detection: A Deep Learning Approach," Engineering, Technology & Applied Science Research, vol. 14, no. 5, pp. 17517–17523, Oct. 2024.
T. N. Nguyen, T. T. Nguyen, T. H. Nguyen, and B. V. Ngo, "A Robust Approach for Breast Cancer Classification from DICOM Images," Engineering, Technology & Applied Science Research, vol. 15, no. 3, pp. 23499–23505, Jun. 2025.
K. Gupta and N. Chawla, "Analysis of Histopathological Images for Prediction of Breast Cancer Using Traditional Classifiers with Pre-Trained CNN," Procedia Computer Science, vol. 167, pp. 878–889, Jan. 2020.
A. M. Ibraheem, K. H. Rahouma, and H. F. A. Hamed, "3PCNNB-Net: Three Parallel CNN Branches for Breast Cancer Classification Through Histopathological Images," Journal of Medical and Biological Engineering, vol. 41, no. 4, pp. 494–503, Aug. 2021.
L. Li et al., "Multi-task deep learning for fine-grained classification and grading in breast cancer histopathological images," Multimedia Tools and Applications, vol. 79, no. 21, pp. 14509–14528, Jun. 2020.
T. Abdeljawad, R. U. Din, N. Fatima, K. Shah, K. J. Ansari, and H. Alrabaiah, "Mathematical modeling of breast cancer with four stages," International Journal of Biomathematics, Apr. 2025, Art. no. 2550036.
T. Mahmood, T. Saba, and A. Rehman, "Breast cancer diagnosis with MFF-HistoNet: a multi-modal feature fusion network integrating CNNs and quantum tensor networks," Journal of Big Data, vol. 12, no. 1, Mar. 2025, Art. no. 60.
"BreakHis - Breast Cancer Histopathological Dataset." Kaggle, [Online]. Available: https://www.kaggle.com/datasets/waseemalastal/breakhis-breast-cancer-histopathological-dataset.
Downloads
How to Cite
License
Copyright (c) 2025 Sarah Kaleem, Mohamed El-Affendi, Muhammad Babar, Zahid Khan

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.