Enhancing Low-Resource Dialectal ASR in Indonesian Using Speech-Transformer Models and Data Augmentation

Sukmawati Nur Endah; . Suprapto; Yohanes Suyanto

doi:10.48084/etasr.12734

Authors

Sukmawati Nur Endah Informatics Department, Universitas Diponegoro, Indonesia | Department of Computer Science and Electronics, Universitas Gadjah Mada, Indonesia https://orcid.org/0000-0002-5833-3671
Suprapto Department of Computer Science and Electronics, Universitas Gadjah Mada, Indonesia
Yohanes Suyanto Department of Computer Science and Electronics, Universitas Gadjah Mada, Indonesia https://orcid.org/0000-0003-1670-8620

Volume: 15 | Issue: 5 | Pages: 28095-28101 | October 2025 | https://doi.org/10.48084/etasr.12734

Received: 17 June 2025 | Revised: 25 July 2025 | Accepted: 14 August 2025 | Online: 6 October 2025

Corresponding author: Suprapto

Abstract

One of the main challenges faced by researchers in speech recognition is the limitation of data, especially for low-resource languages. A common strategy to improve a model's performance is to expand the data space through data augmentation techniques. Data augmentation has proven effective in increasing the amount of training data and reducing the mismatch between training and testing data. Furthermore, data augmentation is essential for improving the performance of deep neural networks by mitigating overfitting and enhancing the models' generalization capabilities. This study compares the impact of several standard augmentation techniques applied to low-resource dialect speech (time stretching, pitch shifting, noise addition, and gain) on speech recognition performance using a Speech-Transformer architecture. The dataset used consists of Indonesian dialectal speech. The results indicate that the average accuracy improvement in recognition was 57.6%, 57.9%, and 59.3% for Character Error Rate (CER), Word Error Rate (WER), and Sentence Error Rate (SER), respectively, compared to speech recognition without any data augmentation.

Keywords:

augmentation, dialectal speech recognition, low-resource

Downloads

Download data is not yet available.

References

R. Gokay and H. Yalcin, "Improving Low Resource Turkish Speech Recognition with Data Augmentation and TTS," in 2019 16th International Multi-Conference on Systems, Signals & Devices, Istanbul, Turkey, 2019, pp. 357–360.

W. Hartmann, T. Ng, R. Hsiao, S. Tsakalidis, and R. Schwartz, "Two-Stage Data Augmentation for Low-Resourced Speech Recognition," in Proc. Interspeech 2020, San Francisco, CA, USA, 2016, pp. 2378–2382.

J. Galic and D. Grozdic, "Exploring the Impact of Data Augmentation Techniques on Automatic Speech Recognition System Development: A Comparative Study," Advances in Electrical and Computer Engineering, vol. 23, no. 3, pp. 3–12, Aug. 2023.

Z. Tu, J. Deadman, N. Ma, and J. Barker, "Auditory-Based Data Augmentation for end-to-end Automatic Speech Recognition," in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, Singapore, 2022, pp. 7447–7451.

M. Soleymanpour, M. T. Johnson, and J. Berry, "Dysarthric Speech Augmentation Using Prosodic Transformation and Masking for Subword End-to-end ASR," in 2021 International Conference on Speech Technology and Human-Computer Dialogue, Bucharest, Romania, 2021, pp. 42–46.

Y. Qian, H. Hu, and T. Tan, "Data augmentation using generative adversarial networks for robust speech recognition," Speech Communication, vol. 114, pp. 1–9, Nov. 2019.

J. Wang, S. Kim, and Y. Lee, "Speech Augmentation Using Wavenet in Speech Recognition," in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 2019, pp. 6770–6774.

X. Song, Z. Wu, Y. Huang, D. Su, and H. Meng, "SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition," in Proc. Interspeech 2020, Shanghai, China, 2020, pp. 581–585.

D. K. Singh, P. P. Amin, H. B. Sailor, and H. A. Patil, "Data Augmentation Using CycleGAN for End-to-End Children ASR," in 2021 29th European Signal Processing Conference, Dublin, Ireland, 2021, pp. 511–515.

P. Sheng, Z. Yang, and Y. Qian, "GANs for Children: A Generative Data Augmentation Strategy for Children Speech Recognition," in 2019 IEEE Automatic Speech Recognition and Understanding Workshop, Singapore, Singapore, 2019, pp. 129–135.

C. Du and K. Yu, "Speaker Augmentation for Low Resource Speech Recognition," in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 2020, pp. 7719–7723.

F. Bao, M. Neumann, and N. T. Vu, "CycleGAN-Based Emotion Style Transfer as Data Augmentation for Speech Emotion Recognition," in Proc. Interspeech 2019, Graz, Austria, 2019, pp. 2828–2832.

A. Chatziagapi et al., "Data Augmentation Using GANs for Speech Emotion Recognition," in Proc. Interspeech 2019, Graz, Austria, 2019, pp. 171–175.

B. T. Atmaja and A. Sasou, "Effects of Data Augmentations on Speech Emotion Recognition," Sensors, vol. 22, no. 16, Aug. 2022, Art. no. 5941.

P. R. R. Gudepu et al., "Whisper Augmented End-to-End/Hybrid Speech Recognition System — CycleGAN Approach," in Proc. Interspeech 2020, Shanghai, China, 2020, pp. 2302–2306.

R. Damania, "Data augmentation for automatic speech recognition for low resource languages," M.S. thesis, Department of Computer Science, Rochester Institute of Technology, Rochester, NY, USA, 2021.

O. O. Abayomi-Alli, R. Damaševičius, A. Qazi, M. Adedoyin-Olowe, and S. Misra, "Data Augmentation and Deep Learning Methods in Sound Classification: A Systematic Review," Electronics, vol. 11, no. 22, Nov. 2022, Art. no. 3795.

M. Muthumari, C. A. Bhuvaneswari, J. E. N. S. Kumar Babu, and S. P. Raju, "Data Augmentation Model for Audio Signal Extraction," in 2022 3rd International Conference on Electronics and Sustainable Communication Systems, Coimbatore, India, 2022, pp. 334–340.

A. Abeysinghe, S. Tohmuang, J. L. Davy, and M. Fard, "Data augmentation on convolutional neural networks to classify mechanical noise," Applied Acoustics, vol. 203, Feb. 2023, Art. no. 109209.

N. Hosseini-Kivanani, H. Asadi, and C. Schommer, "Speaker Verification Enhancement via Speaking Rate Dynamics in Persian Speechprints," in Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods, Porto, Portugal, 2025, pp. 665–672.

A. Mehrish, N. Majumder, R. Bharadwaj, R. Mihalcea, and S. Poria, "A review of deep learning techniques for speech processing," Information Fusion, vol. 99, Nov. 2023, Art. no. 101869.

T. P. Rosin, J. Gachot, H.-L. Kordt, M. Kerzel, and S. Wermter, "Talking to Robots: A Practical Examination of Speech Foundation Models for HRI Applications." arXiv, Aug. 25, 2025.

A. Alahmadi, A. Alahmadi, E. Alduweib, W. Alromema, and B. Ahmed, "Development of a Deep Learning-based Arabic Speech Recognition System for Automatons," Engineering, Technology & Applied Science Research, vol. 14, no. 6, pp. 18439–18446, Dec. 2024.

A. Vaswani et al., "Attention is all you need," in Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6000–6010.

L. Dong, S. Xu, and B. Xu, "Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 5884–5888.

A. Gulati et al., "Conformer: Convolution-augmented Transformer for Speech Recognition," in Proc. Interspeech 2020, Shanghai, China, 2020, pp. 5036–5040.

S. Yu and P. Li, "Transformer Based End-to-End Speech Recognition with Linear Attention," in 2022 8th International Conference on Control, Automation and Robotics, Xiamen, China, 2022, pp. 340–344.

S. Li, M. Xu, and X.-L. Zhang, "Efficient conformer-based speech recognition with linear attention," in 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Tokyo, Japan, 2021, pp. 448–453.

S. N. Endah, R. Kusumaningrum, and S. Adhy, "ID1 : Indonesian Dataset1." Zenodo, Jul. 11, 2025.

S. N. Endah, Suprapto, and Y. Suyanto, "ID2 : Indonesian Dataset2." Zenodo, Jul. 11, 2025.

S. Gupta, J. Jaafar, W. F. Wan Ahmad, and A. Bansal, "Feature Extraction using MFCC," Signal & Image Processing : An International Journal, vol. 4, no. 4, pp. 101–108, Aug. 2013.

A. P. F. Naiborhu and S. N. Endah, "Indonesian Continuous Speech Recognition Using CNN and Bidirectional LSTM," in 2021 5th International Conference on Informatics and Computational Sciences, Semarang, Indonesia, 2021, pp. 122–127.

S. N. Endah, R. Rismiyati, P. S. Sasongko, and A. P. F. Noiborhu, "Indonesian continuous speech recognition optimization with convolution bidirectional long short-term memory architecture," Telecommunication Computing Electronics and Control, vol. 23, no. 3, pp. 807–815, Jun. 2025.

Enhancing Low-Resource Dialectal ASR in Indonesian Using Speech-Transformer Models and Data Augmentation

Authors

Abstract

Keywords:

Downloads

References

Downloads

How to Cite

Metrics

License