D-CNN: A New model for Generating Image Captions with Text Extraction Using Deep Learning for Visually Challenged Individuals


  • M. Bhalekar School of Computer Engineering and Technology, MIT World Peace University, India
  • M. Bedekar School of Computer Engineering and Technology, MIT World Peace University, India https://orcid.org/0000-0003-4461-9641


Automatically describing the information of an image using properly constructed sentences is a tricky task in any language. However, it has the potential to have a significant effect by enabling visually challenged individuals to better understand their surroundings. This paper proposes an image captioning system that generates detailed captions and extracts text from an image, if any, and uses it as a part of the caption to provide a more precise description of the image. To extract the image features, the proposed model uses Convolutional Neural Networks (CNNs) followed by Long Short-Term Memory (LSTM) that generates corresponding sentences based on the learned image features. Further, using the text extraction module, the extracted text (if any) is included in the image description and the captions are presented in audio form. Publicly available benchmark datasets for image captioning like MS COCO, Flickr-8k, Flickr-30k have a variety of images, but they hardly have images that contain textual information. These datasets are not sufficient for the proposed model and this has resulted in the creation of a new image caption dataset that contains images with textual content. With the newly created dataset, comparative analysis of the experimental results is performed on the proposed model and the existing pre-trained model. The obtained experimental results show that the proposed model is equally effective as the existing one in subtitle image captioning models and provides more insights about the image by performing text extraction.


image captioning, text extraction, convolutional model, long short-term memory, deep learning


Download data is not yet available.


K. Fu, J. Jin, R. Cui, F. Sha, and C. Zhang, "Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2321–2334, Dec. 2017. DOI: https://doi.org/10.1109/TPAMI.2016.2642953

G. Kulkarni et al., "Baby talk: Understanding and generating simple image descriptions," in CVPR 2011, Colorado Springs, CO, USA, Jun. 2011, pp. 1601–1608. DOI: https://doi.org/10.1109/CVPR.2011.5995466

A. Karpathy and L. Fei-Fei, "Deep Visual-Semantic Alignments for Generating Image Descriptions," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 664–676, Dec. 2017. DOI: https://doi.org/10.1109/TPAMI.2016.2598339

J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, "Explain Images with Multimodal Recurrent Neural Networks," arXiv:1410.1090 [cs], Oct. 2014, Accessed: Feb. 23, 2022. [Online]. Available: http://arxiv.org/abs/1410.1090.

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: A neural image caption generator," in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, Jun. 2015, pp. 3156–3164. DOI: https://doi.org/10.1109/CVPR.2015.7298935

X. Chen and C. L. Zitnick, "Mind’s eye: A recurrent visual representation for image caption generation," in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, Jun. 2015, pp. 2422–2431. DOI: https://doi.org/10.1109/CVPR.2015.7298856

K. Xu et al., "Show, attend and tell: neural image caption generation with visual attention," in Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, Lille, France, Apr. 2015, pp. 2048–2057.

M. Bhalekar, S. Sureka, S. Joshi, and M. Bedekar, "Generation of Image Captions Using VGG and ResNet CNN Models Cascaded with RNN Approach," in Machine Intelligence and Signal Processing, Singapore, 2020, pp. 27–42. DOI: https://doi.org/10.1007/978-981-15-1366-4_3

M. Guillaumin, J. Verbeek, and C. Schmid, "Multimodal semi-supervised learning for image classification," in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, Jun. 2010, pp. 902–909. DOI: https://doi.org/10.1109/CVPR.2010.5540120

S. Nuanmeesri, "A Hybrid Deep Learning and Optimized Machine Learning Approach for Rose Leaf Disease Classification," Engineering, Technology & Applied Science Research, vol. 11, no. 5, pp. 7678–7683, Oct. 2021. DOI: https://doi.org/10.48084/etasr.4455

S. L. Sanga, D. Machuve, and K. Jomanga, "Mobile-based Deep Learning Models for Banana Disease Detection," Engineering, Technology & Applied Science Research, vol. 10, no. 3, pp. 5674–5677, Jun. 2020. DOI: https://doi.org/10.48084/etasr.3452

C. Szegedy, A. Toshev, and D. Erhan, "Deep Neural Networks for Object Detection," in Advances in Neural Information Processing Systems, 2013, vol. 26.

X. Wang, Z. Zhu, C. Yao, and X. Bai, "Relaxed Multiple-Instance SVM with Application to Object Discovery," in 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, Sep. 2015, pp. 1224–1232. DOI: https://doi.org/10.1109/ICCV.2015.145

T.-Y. Lin et al., "Microsoft COCO: Common Objects in Context," in Computer Vision – ECCV 2014, 2014, pp. 740–755. DOI: https://doi.org/10.1007/978-3-319-10602-1_48

S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," Neural Computation, vol. 9, no. 8, pp. 1735–1780, Aug. 1997. DOI: https://doi.org/10.1162/neco.1997.9.8.1735

K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, "LSTM: A Search Space Odyssey," IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 10, pp. 2222–2232, Jul. 2017. DOI: https://doi.org/10.1109/TNNLS.2016.2582924

G. A. Robby, A. Tandra, I. Susanto, J. Harefa, and A. Chowanda, "Implementation of Optical Character Recognition using Tesseract with the Javanese Script Target in Android Application," Procedia Computer Science, vol. 157, pp. 499–505, Jan. 2019. DOI: https://doi.org/10.1016/j.procs.2019.09.006

F. Alotaibi, M. T. Abdullah, R. B. H. Abdullah, R. W. B. O. K. Rahmat, I. A. T. Hashem, and A. K. Sangaiah, "Optical Character Recognition for Quranic Image Similarity Matching," IEEE Access, vol. 6, pp. 554–562, 2018. DOI: https://doi.org/10.1109/ACCESS.2017.2771621

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "BLEU: a method for automatic evaluation of machine translation," in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, USA, Apr. 2002, pp. 311–318. DOI: https://doi.org/10.3115/1073083.1073135

R. Vedantam, C. L. Zitnick, and D. Parikh, "CIDEr: Consensus-based image description evaluation," in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, Jun. 2015, pp. 4566–4575. DOI: https://doi.org/10.1109/CVPR.2015.7299087

M. Hodosh, P. Young, and J. Hockenmaier, "Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics," Journal of Artificial Intelligence Research, vol. 47, pp. 853–899, Aug. 2013. DOI: https://doi.org/10.1613/jair.3994

C. Alippi, S. Disabato, and M. Roveri, "Moving Convolutional Neural Networks to Embedded Systems: The AlexNet and VGG-16 Case," in 2018 17th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), Porto, Portugal, Apr. 2018, pp. 212–223. DOI: https://doi.org/10.1109/IPSN.2018.00049

X. Xia, C. Xu, and B. Nan, "Inception-v3 for flower classification," in 2017 2nd International Conference on Image, Vision and Computing (ICIVC), Chengdu, China, Jun. 2017, pp. 783–787.

L. Alzubaidi et al., "Review of deep learning: concepts, CNN architectures, challenges, applications, future directions," Journal of Big Data, vol. 8, no. 1, Nov. 2021, Art. no. 53. DOI: https://doi.org/10.1186/s40537-021-00444-8

B. Ahmed, G. Ali, A. Hussain, A. Baseer, and J. Ahmed, "Analysis of Text Feature Extractors using Deep Learning on Fake News," Engineering, Technology & Applied Science Research, vol. 11, no. 2, pp. 7001–7005, Apr. 2021. DOI: https://doi.org/10.48084/etasr.4069


How to Cite

M. Bhalekar and M. Bedekar, “D-CNN: A New model for Generating Image Captions with Text Extraction Using Deep Learning for Visually Challenged Individuals”, Eng. Technol. Appl. Sci. Res., vol. 12, no. 2, pp. 8366–8373, Apr. 2022.


Abstract Views: 958
PDF Downloads: 683

Metrics Information

Most read articles by the same author(s)