Enhanced Square Fiducial Marker Recognition under Challenging Visual Environments Using Multi-Scale CNN-Transformer Fusion
Received: 11 June 2025 | Revised: 11 July 2025, 23 July 2025, and 4 August 2025 | Accepted: 11 August 2025 | Online: 2 September 2025
Corresponding author: Liliek Triyono
Abstract
Recent methods using deep learning have demonstrated promising outcomes in tackling the issue of object recognition in low-light images. However, existing techniques often face challenges related to distortion and occlusions, and many strategies rely on neural networks with convolutional neural network (CNN) structures, which are limited in their ability to capture long-term dependencies. This frequently leads to inadequate recovery of very dark areas in low-light images. This work introduces a unique Transformer-based method for ArUco marker recognition in low-light environments, termed Extreme ArUco Vision Transformer (XAViT). We present a Transformer–CNN hybrid block that utilizes mixed attention to effectively capture both global and local information. This method integrates the Transformer's capacity to model long-range dependencies with the CNN's proficiency in extracting detailed features, facilitating the reliable detection of ArUco markers even in extreme lighting conditions. Additionally, we employ a Swin-Transformer discriminator to selectively improve various areas of low-light images, alleviating problems of overexposure, underexposure, and noise. Comprehensive experiments show that XAViT achieves 99.16% accuracy, 97.86% recall, 97.95% precision, and 97.89% F1-score on a realistic low-light dataset, outperforming state-of-the-art CNN and Transformer models. Moreover, its utilization in additional vision-based tasks underscores its potential for wider implementation in advanced vision applications.
Keywords:
low-light image, indoor navigation, computer vision, markers, assistive technologyDownloads
References
R. Liu, L. Ma, J. Zhang, X. Fan, and Z. Luo, "Retinex-inspired Unrolling with Cooperative Prior Architecture Search for Low-light Image Enhancement," in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 10556–10565.
J. Liang, Y. Xu, Y. Quan, J. Wang, H. Ling, and H. Ji, "Deep Bilateral Retinex for Low-Light Image Enhancement." arXiv, Jul. 04, 2020.
J. Subash and J. Majumdar, "Comparison of Image Enhancement Algorithms for Improving the Visual Quality in Computer Vision Application," International Journal of Advanced Computer Science and Applications, vol. 13, no. 7, pp. 638–654, Jul. 2022.
Z. Tian et al., "A Survey of Deep Learning-Based Low-Light Image Enhancement," Sensors, vol. 23, no. 18, Sep. 2023, Art. no. 7763.
W. Huang, Y. Zhu, and R. Huang, "Low Light Image Enhancement Network With Attention Mechanism and Retinex Model," IEEE Access, vol. 8, pp. 74306–74314, 2020.
R. Khan, Q. Liu, and Y. Yang, "A Deep Hybrid Few Shot Divide and Glow Method for Ill-Light Image Enhancement," IEEE Access, vol. 9, pp. 17767–17778, 2021.
X. Li, Q. Yu, X. Pan, and Z. Yu, "Research on the Contrast Enhancement Algorithm for X-ray Images of BiFeO3 Material Experiment," Applied Sciences, vol. 14, no. 9, May 2024, Art. no. 3546.
J. Park, A. G. Vien, J.-H. Kim, and C. Lee, "Histogram-Based Transformation Function Estimation for Low-Light Image Enhancement," in 2022 IEEE International Conference on Image Processing, Bordeaux, France, 2022, pp. 1–5.
Y. Zhou, Y. Wang, and W. Cai, "Cycle-enhance: low-light image enhancement based on CycleGan," in Second International Conference on Electronic Information Engineering, Big Data, and Computer Technology, Xishuangbanna, China, 2023, pp. 178–183.
Y. Fan et al., "Laser Image Enhancement Algorithm Based on Improved EnlightenGAN," Electronics, vol. 12, no. 9, May 2023, Art. no. 2081.
X. Zhao and L. Li, "A low-light-level image enhancement algorithm combining Retinex and Transformer," in International Conference on Remote Sensing, Mapping, and Image Processing, Xiamen, China, 2024, pp. 704–712.
M. He, R. Wang, Y. Wang, F. Zhou, and N. Guo, "DMPH-Net: a deep multi-scale pyramid hybrid network for low-light image enhancement with attention mechanism and noise reduction," Signal, Image and Video Processing, vol. 17, no. 8, pp. 4533–4542, Nov. 2023.
M. Elgendy, C. Sik-Lanyi, and A. Kelemen, "A Novel Marker Detection System for People with Visual Impairment Using the Improved Tiny-YOLOv3 Model," Computer Methods and Programs in Biomedicine, vol. 205, Jun. 2021, Art. no. 106112.
A. Fisne, A. Kalay, F. Yavuz, C. Cetintepe, and A. Ozsoy, "Energy-efficient computing for machine learning based target detection," Concurrency and Computation: Practice and Experience, vol. 35, no. 24, Nov. 2023, Art. no. e7582.
D. Heller, M. Rizk, R. Douguet, A. Baghdadi, and J.-Ph. Diguet, "Marine Objects Detection Using Deep Learning on Embedded Edge Devices," in 2022 IEEE International Workshop on Rapid System Prototyping, Shanghai, China, 2022, pp. 1–7.
A. Radman, F. Mohammadimanesh, and M. Mahdianpari, "Wet-ConViT: A Hybrid Convolutional–Transformer Model for Efficient Wetland Classification Using Satellite Data," Remote Sensing, vol. 16, no. 14, Jul. 2024, Art. no. 2673.
K. Ren, T. Zhang, X. Li, Y. Du, and H. Han, "HTViT: an efficient CNN-Transformer hybrid model with high throughput," in Optoelectronic Imaging and Multimedia Technology XI, Nantong, Jiangsu, China, 2024, pp. 332–341.
D. Hendrycks and K. Gimpel, "Gaussian Error Linear Units (GELUs)." arXiv, Jun. 06, 2023.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "ImageNet: A large-scale hierarchical image database," in 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 248–255.
S. M. S. Zobly, "Comparison of Different Image Enhancement Methods for Effective Whole-Body Bone Scan Image," Advances in Bioscience and Bioengineering, vol. 7, no. 3, pp. 55–59, Aug. 2019.
A. Arora et al., "Low Light Image Enhancement via Global and Local Context Modeling." arXiv, Jan. 04, 2021.
L. Triyono, R. Gernowo, and Prayitno, "MoNetViT: an efficient fusion of CNN and transformer technologies for visual navigation assistance with multi query attention," Frontiers in Computer Science, vol. 7, Feb. 2025, Art. no. 1510252.
F. Cutolo, V. Mamone, N. Carbonaro, V. Ferrari, and A. Tognetti, "Ambiguity-Free Optical–Inertial Tracking for Augmented Reality Headsets," Sensors, vol. 20, no. 5, Mar. 2020, Art. no. 1444.
Downloads
How to Cite
License
Copyright (c) 2025 Liliek Triyono, Rahmat Gernowo, Prayitno, Eko Harry Pratisto

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.