ALBERTIR: A BERT-Based Pretraining for Indonesian Religious Texts Using Qur'an and Hadith Translations

Irwan Darmawan; Hakkun Elmunsyah; Didik Dwi Prasetya

doi:10.48084/etasr.12977

Authors

Irwan Darmawan Department of Electrical Engineering and Informatics, Universitas Negeri Malang, Indonesia https://orcid.org/0009-0005-0657-1087
Hakkun Elmunsyah Department of Electrical Engineering and Informatics, Universitas Negeri Malang, Indonesia https://orcid.org/0000-0002-0754-3097
Didik Dwi Prasetya Department of Electrical Engineering and Informatics, Universitas Negeri Malang, Indonesia https://orcid.org/0000-0002-3540-2961

Volume: 15 | Issue: 5 | Pages: 28307-28312 | October 2025 | https://doi.org/10.48084/etasr.12977

Received: 27 June 2025 | Revised: 12 July 2025, 21 July 2025, and 24 July 2025 | Accepted: 27 July 2025 | Online: 17 September 2025

Corresponding author: Hakkun Elmunsyah

Abstract

This study introduces Al-Qur’an BERT for Indonesian Religious Texts (ALBERTIR), a domain-adaptive Bidirectional Encoder Representations from Transformers (BERT) model pretrained on Indonesian religious texts, including official Qur’an and Hadith translations. The corpus comprises over 1.2 million tokens sourced from verified government publications and optimized for Masked Language Modeling (MLM). ALBERTIR features weighted MLM, sacred term preservation, and factorized embeddings to enhance understanding of religious semantics and maintain doctrinal integrity. Training was conducted on Google Colab Pro with TPU v3-8, where ALBERTIR outperformed BERT-base and A Lite BERT for Indonesian (ALBERT-ID), improving religious term prediction by 10.9% and reducing training time by more than 40%. Across downstream tasks such as religious question answering, sentiment analysis, and text classification, it achieved up to 8% higher F1-scores. Ablation studies confirmed the effectiveness of its core components, demonstrating advantages in semantic accuracy, contextual sensitivity, and reliability in religious Natural Language Processing (NLP) applications. Unlike general-purpose models like Indonesian BERT (IndoBERT) and multilingual BERT (mBERT), the proposed model is specifically optimized for theological language, thereby reducing vague or contextually inappropriate outputs. This makes it especially suitable for applications such as fatwa retrieval, Islamic education tools, and religious chatbot systems. Cross-lingual evaluations further showed that ALBERTIR surpasses mBERT by +13.3 Bilingual Evaluation Understudy (BLEU)-4 points in religious Questioning-Answering (QA) tasks, while maintaining competitive performance in general benchmarks. Ablation results identified sacred term preservation as the most critical contributor to accuracy gains, underscoring the importance of domain-specific features. Overall, ALBERTIR demonstrates strong capabilities in capturing linguistic precision and theological nuance, establishing a robust foundation for future religious NLP research and applications.

Keywords:

Bidirectional Encoder Representations from Transformers (BERT), domain adaptation, religious Natural Language Processing (NLP), Qur'an translation, Hadith, Indonesian language

Downloads

Download data is not yet available.

References

K. Gaanoun and M. Alsuhaibani, "Sentiment preservation in Quran translation with artificial intelligence approach: study in reputable English translation of the Quran," Humanities and Social Sciences Communications, vol. 12, no. 1, Feb. 2025, Art. no. 222.

F. Qarah and T. Alsanoosy, "Evaluation of Arabic Large Language Models on Moroccan Dialect," Engineering, Technology & Applied Science Research, vol. 15, no. 3, pp. 22478–22485, Jun. 2025.

Y. Qiu and Y. Jin, "ChatGPT and finetuned BERT: A comparative study for developing intelligent design support systems," Intelligent Systems with Applications, vol. 21, Mar. 2024, Art. no. 200308.

I. A. Mannix and E. Yulianti, "Academic expert finding using BERT pre-trained language model," International Journal of Advances in Intelligent Informatics, vol. 10, no. 2, May 2024, Art. no. 280.

J. Campino, "Unleashing the transformers: NLP models detect AI writing in education," Journal of Computers in Education, vol. 12, no. 2, pp. 645–673, Jun. 2025.

M. Müller, M. Salathé, and P. E. Kummervold, "COVID-Twitter-BERT: A natural language processing model to analyse COVID-19 content on Twitter," Frontiers in Artificial Intelligence, vol. 6, Mar. 2023, Art. no. 1023281.

M. S. Sayeed, V. Mohan, and K. S. Muthu, "BERT: A Review of Applications in Sentiment Analysis," HighTech and Innovation Journal, vol. 4, no. 2, pp. 453–462, Jun. 2023.

Y. Kim et al., "A pre-trained BERT for Korean medical natural language processing," Scientific Reports, vol. 12, no. 1, Aug. 2022, Art. no. 13847.

A. S. Alammary, "BERT Models for Arabic Text Classification: A Systematic Review," Applied Sciences, vol. 12, no. 11, Jun. 2022, Art. no. 5720.

R. Malhas and T. Elsayed, "Arabic machine reading comprehension on the Holy Qur’an using CL-AraBERT," Information Processing & Management, vol. 59, no. 6, Nov. 2022, Art. no. 103068.

M. M. Abdelgwad, T. H. A. Soliman, and A. I. Taloba, "Arabic aspect sentiment polarity classification using BERT," Journal of Big Data, vol. 9, no. 1, Dec. 2022, Art. no. 115.

S. M. Isa, G. Nico, and M. Permana, "IndoBERT for Indonesian Fake News Detection," ICIC Express Letters, vol. 16, no. 3, Mar. 2022.

Y. Wu, Z. Liu, L. Wu, M. Chen, and W. Tong, "BERT-Based Natural Language Processing of Drug Labeling Documents: A Case Study for Classifying Drug-Induced Liver Injury Risk," Frontiers in Artificial Intelligence, vol. 4, Dec. 2021, Art. no. 729834.

A. Abuzayed and H. Al-Khalifa, "BERT for Arabic Topic Modeling: An Experimental Study on BERTopic Technique," Procedia Computer Science, vol. 189, pp. 191–194, 2021.

A. K. Ambalavanan and M. V. Devarakonda, "Using the contextual language model BERT for multi-criteria classification of scientific articles," Journal of Biomedical Informatics, vol. 112, Dec. 2020, Art. no. 103578.

A. Adhikari, A. Ram, R. Tang, and J. Lin, "DocBERT: BERT for Document Classification." arXiv, 2019.

R. Pramana, M. Jonathan, H. S. Yani, and R. Sutoyo, "A Comparison of BiLSTM, BERT, and Ensemble Method for Emotion Recognition on Indonesian Product Reviews," Procedia Computer Science, vol. 245, pp. 399–408, 2024.

D. Endalie, "Fine‐Tuning BERT Models for Multiclass Amharic News Document Categorization," Complexity, vol. 2025, no. 1, Jan. 2025, Art. no. 1884264.

C. H. Lin and U. Nuha, "Sentiment analysis of Indonesian datasets based on a hybrid deep-learning strategy," Journal of Big Data, vol. 10, no. 1, May 2023, Art. no. 88.

F. Baharuddin and M. F. Naufal, "Fine-Tuning IndoBERT for Indonesian Exam Question Classification Based on Bloom’s Taxonomy," Journal of Information Systems Engineering and Business Intelligence, vol. 9, no. 2, pp. 253–263, Nov. 2023.

G. Z. Nabiilah, S. Y. Prasetyo, Z. N. Izdihar, and A. S. Girsang, "BERT base model for toxic comment analysis on Indonesian social media," Procedia Computer Science, vol. 216, pp. 714–721, 2023.

Supriyono, A. P. Wibawa, Suyono, and F. Kurniawan, "Advancements in natural language processing: Implications, challenges, and future directions," Telematics and Informatics Reports, vol. 16, Dec. 2024, Art. no. 100173.

T. L. M. Suryanto, A. P. Wibawa, H. Hariyono, and A. Nafalski, "Comparative Performance of Transformer Models for Cultural Heritage in NLP Tasks," Advance Sustainable Science Engineering and Technology, vol. 7, no. 1, Jan. 2025, Art. no. 0250115.

S. Supriyono, "Analyzing Audience Sentiments in Digital Comedy: A Study of YouTube Comments Using LSTM Models," Journal of Applied Data Sciences, vol. 5, no. 4, pp. 1877–1889, Dec. 2024.

D. A. Sulistyo, A. P. Wibawa, D. D. Prasetya, and F. A. Ahda, "An enhanced pivot-based neural machine translation for low-resource languages," International Journal of Advances in Intelligent Informatics, vol. 11, no. 2, May 2025, Art. no. 258.

D. A. Sulistyo, D. D. Prasetya, F. A. Ahda, and A. P. Wibawa, "Pivoted Low Resource Multilingual Translation with NER Optimization," ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 24, no. 5, pp. 1–16, May 2025.

Sucipto, D. D. Prasetya, and T. Widiyaningtyas, "Α Supervised Hybrid Weighting Scheme for Bloom’s Taxonomy Questions using Category Space Density-based Weighting," Engineering, Technology & Applied Science Research, vol. 15, no. 2, pp. 22102–22108, Apr. 2025.

A. Turchin, S. Masharsky, and M. Zitnik, "Comparison of BERT implementations for natural language processing of narrative medical documents," Informatics in Medicine Unlocked, vol. 36, 2023, Art. no. 101139.

D. Wardani, H. Abdurrahman Syah, A. Wijayanto, and B. Harjito, "Flexible Semantic Qur’an Question Answering Using Graph-Based Summarization and KNN," JOIV : International Journal on Informatics Visualization, vol. 8, no. 4, Dec. 2024, Art. no. 2155.

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." arXiv, 2019.

N. M. Gardazi, A. Daud, M. K. Malik, A. Bukhari, T. Alsahfi, and B. Alshemaimri, "BERT applications in natural language processing: a review," Artificial Intelligence Review, vol. 58, no. 6, Mar. 2025, Art. no. 166.

ALBERTIR: A BERT-Based Pretraining for Indonesian Religious Texts Using Qur'an and Hadith Translations

Authors

Abstract

Keywords:

Downloads

References

Downloads

How to Cite

Metrics

License