Improving the Recognition Performance of Lip Reading Using the Concatenated Three Sequence Keyframe Image Technique

L. Poomhiran; P. Meesad; S. Nuanmeesri

doi:10.48084/etasr.4102

Authors

L. Poomhiran Faculty of Information Technology and Digital Innovation, King Mongkut’s University of Technology North Bangkok, Thailand https://orcid.org/0000-0003-2658-7973
P. Meesad Faculty of Information Technology and Digital Innovation, King Mongkut’s University of Technology North Bangkok, Thailand
S. Nuanmeesri Faculty of Science and Technology, Suan Sunandha Rajabhat University, Thailand https://orcid.org/0000-0002-2511-9820

Volume: 11 | Issue: 2 | Pages: 6986-6992 | April 2021 | https://doi.org/10.48084/etasr.4102

Received: 17 February 2021 | Revised: 5 March 2021 | Accepted: 9 March 2021 | Online: 11 April 2021

Corresponding author: L. Poomhiran

Abstract

This paper proposes a lip reading method based on convolutional neural networks applied to Concatenated Three Sequence Keyframe Image (C3-SKI), consisting of (a) the Start-Lip Image (SLI), (b) the Middle-Lip Image (MLI), and (c) the End-Lip Image (ELI) which is the end of the pronunciation of that syllable. The lip area’s image dimensions were reduced to 32×32 pixels per image frame and three keyframes concatenate together were used to represent one syllable with a dimension of 96×32 pixels for visual speech recognition. Every three concatenated keyframes representing any syllable are selected based on the relative maximum and relative minimum related to the open lip’s width and height. The evaluation results of the model’s effectiveness, showed accuracy, validation accuracy, loss, and validation loss values at 95.06%, 86.03%, 4.61%, and 9.04% respectively, for the THDigits dataset. The C3-SKI technique was also applied to the AVDigits dataset, showing 85.62% accuracy. In conclusion, the C3-SKI technique could be applied to perform lip reading recognition.

Keywords:

concatenated frame images, convolutional neural network, keyframe reduction, keyframe sequence, lip reading

References

K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings IEEE Computer Visualization and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770-778. https://doi.org/10.1109/CVPR.2016.90

S. Fenghour, D. Chen, and P. Xiao, "Decoder-encoder LSTM for lip reading," in Proceedings of the 2019 8th International Conference on Software and Information Engineering, Cairo, Egypt, Apr. 9-12, 2019, pp. 162-166 https://doi.org/10.1145/3328833.3328845

S. Petridis, Z. Li, and M. Pantic, "End-to-end visual speech recognition with LSTMS," in Proceedings of the 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, New Orleans, LA, USA, Mar. 5-9, 2017, pp. 2592-2596. https://doi.org/10.1109/ICASSP.2017.7952625

S. Chung, J. S. Chung, and H. Kang, "Perfect match: Improved cross-modal embeddings for audio-visual synchronisation," in Proceedings of the 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, May 12-17, 2019, pp. 3965-3969. https://doi.org/10.1109/ICASSP.2019.8682524

R. Bi and M. Swerts, "A perceptual study of how rapidly and accurately audiovisual cues to utterance-final boundaries can be interpreted in Chinese and English," Speech Communication, vol. 95, pp. 68-77, 2017. https://doi.org/10.1016/j.specom.2017.07.002

D. Jang, H. Kim, C. Je, R. Park, and H. Park, "Lip reading using committee networks with two different types of concatenated frame images," IEEE Access, vol. 7, pp. 90125-90131, 2019.

A. Mesbah, A. Berrahou, H. Hammouchi, H. Berbia, H. Qjidaa, and M. Daoudi, "Lip reading with Hahn convolutional neural networks," Image and Vision Computing, vol. 88, pp. 76-83, 2019 https://doi.org/10.1016/j.imavis.2019.04.010

J. S. Chung and A. Zisserman, "Learning to lip read words by watching videos," Computer Vision and Image Understanding, vol. 173, pp. 76-85, 2018 https://doi.org/10.1016/j.cviu.2018.02.001

Z. Thabet, A. Nabih, K. Azmi, Y. Samy, G. Khoriba, and M. Elshehaly, "Lipreading using a comparative machine learning approach," in Proceedings of the 2018 First International Workshop on Deep and Representation Learning, Cairo, Egypt, 2018, pp. 19-25. https://doi.org/10.1109/IWDRL.2018.8358210

S. Petridis, J. Shen, D. Cetin, and M. Pantic, "Visual-only recognition of normal, whispered and silent speech," in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, Calgary, AB, Canada, 2018, pp. 6219-6223. https://doi.org/10.1109/ICASSP.2018.8461596

A. Koumparoulis and G. Potamianos, "Deep View2View mapping for view-invariant lipreading," in Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, December 18-21, 2018, pp. 588-594. https://doi.org/10.1109/SLT.2018.8639698

J. Wei, F. Yang, J. Zhang, R. Yu, M. Yu, and J. Wang, "Three-dimensional joint geometric-physiologic feature for lip-reading," in Proceedings of the 2018 IEEE 30th International Conference on Tools with Artificial Intelligence, Greece, 2018, pp. 1007-1012. https://doi.org/10.1109/ICTAI.2018.00155

I. Fung and B. K. Mak, "End-to-end low-resource lip-reading with Maxout CNN and LSTM," in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, Calgary, AB, Canada, 2018, pp. 2511-2515.

T. Thein and K. M. San, "Lip localization technique towards an automatic lip reading approach for Myanmar consonants recognition," in Proceedings of the 2018 International Conference on Information and Computer Technologies, IL, USA, 2018, pp. 123-127. https://doi.org/10.1109/INFOCT.2018.8356854

S. Yang et al., "LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild," in Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition, Lille, France, 2019, pp. 1-8. https://doi.org/10.1109/FG.2019.8756582

J. S. Chung and A. Zisserman, "Lip reading in profile," in Proceedings of the 28th British Machine Vision Conference, London, UK, 2017.

P. P. Filntisis, A. Katsamanis, P. Tsiakoulis, and P. Maragos, "Video-realistic expressive audio-visual speech synthesis for the Greek language," Speech Communication, vol. 95, pp. 137-152, 2017. https://doi.org/10.1016/j.specom.2017.08.011

T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, "Deep audio-visual speech recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1-11, 2018. https://doi.org/10.1109/TPAMI.2018.2889052

S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic, "End-to-end audiovisual speech recognition," in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2018, pp. 6548-6552. https://doi.org/10.1109/ICASSP.2018.8461326

Y. Yuan, C. Tian, and X. Lu, "Auxiliary loss multimodal GRU model in audio-visual speech recognition," IEEE Access, vol. 6, pp. 5573-5583, 2018. https://doi.org/10.1109/ACCESS.2018.2796118

S. Petridis, T. Stafylakis, P. Ma, G. Tzimiropoulos, and M. Pantic, "Audio-visual speech recognition with a Hybrid CTC/Attention architecture," in Proceedings of the 2018 IEEE Spoken Language Technology Workshop, Athens, Greece, 2018, pp. 513-520. https://doi.org/10.1109/SLT.2018.8639643

W. J. Ma, X. Zhou, L. A. Ross, J. J. Foxe, and L. C. Parra, "Lip-reading aids word recognition most in moderate noise: A bayesian explanation using high-dimensional feature space," PLoS ONE, vol. 4, no. 3, 2009, Art. no. e4638. https://doi.org/10.1371/journal.pone.0004638

M. Wand, J. Koutník, and J. Schmidhuber, "Lipreading with long short-term memory," in Proceedings of the 2016 IEEE International Conference on Acoustics, Speech, and Signal Processing, Shanghai, China, 2016, pp. 6115-6119. https://doi.org/10.1109/ICASSP.2016.7472852

A. Gabbay, A. Shamir, and S. Peleg, "Visual speech enhancement," in Proceedings of Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, India, Sep. 2-6, 2018, pp. 1170-1174. https://doi.org/10.21437/Interspeech.2018-1955

M. Wand, J. Schmidhuber, and N. T. Vu, "Investigations on end-to-end audiovisual fusion," in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, Calgary, AB, Canada, 2018, pp. 3041-3045. https://doi.org/10.1109/ICASSP.2018.8461900

D. Hu, X. Li, and X. Lu, "Temporal multimodal learning in audiovisual speech recognition," in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 3574-3582. https://doi.org/10.1109/CVPR.2016.389

A. Fernandez-Lopez and F. M. Sukno, "Automatic viseme vocabulary construction to enhance continuous lip-reading," in Proceedings of the 12th International Conference on Computer Vision Theory and Applications, Porto, Portugal, Feb. 27- Mar. 1, 2017, pp. 52-63. https://doi.org/10.5220/0006102100520063

K. Paleček, "Experimenting with lipreading for large vocabulary continuous speech recognition," Journal on Multimodal User Interfaces, vol. 12, no. 4, pp. 309-318, 2018. https://doi.org/10.1007/s12193-018-0266-2

P. Viola and M. J. Jones, "Robust real-time face detection," International Journal of Computer Vision, vol. 57, no. 2, pp. 137-154, 2004. https://doi.org/10.1023/B:VISI.0000013087.49260.fb

Y.-Q. Wang, "An analysis of the Viola-Jones face detection algorithm," Image Processing On Line, vol. 4, pp. 128-148, 2014. https://doi.org/10.5201/ipol.2014.104

J. M. Saragih, S. Lucey, and J. F. Cohn, "Deformable model fitting by regularized landmark mean-shift," International Journal of Computer Vision, vol. 91, pp. 200-215, 2011. https://doi.org/10.1007/s11263-010-0380-4

K. Janocha and W. M. Czarnecki, "On loss functions for deep neural networks in classification," Schedae Informaticae, vol. 25, pp. 49-59, 2016.

Z. Zhang and M. R. Sabuncu, "Generalized cross entropy loss for training deep neural networks with noisy labels," in Proceedings of the 32nd Conference on Neural Information Processing Systems, Montréal, Canada, Dec. 2-8, 2018.

Q. Zhu, Z. He, T. Zhang, and W. Cui, "Improving classification performance of softmax loss function based on scalable batch-normalization," Applied Sciences, vol. 10, no. 8, pp. 29-50, 2020. https://doi.org/10.3390/app10082950

N. Srivastava and R. Salakhutdinov, "Learning representations for multimodal data with deep belief nets," presented at the 29th International Conference on Machine Learning Workshop, Edinburgh, UK, Jun. 26-Jul. 1, 2012.

M. B. Ayed, "Balanced communication-avoiding support vector machine when detecting epilepsy based on EGG signals," Engineering, Technology & Applied Science Research, vol. 10, no. 6, pp. 6462-6468, 2020. https://doi.org/10.48084/etasr.3878

S. Nuanmeesri, "Mobile application for the purpose of marketing, product distribution and location-based logistics for elderly farmers," Applied Computing and Informatics, 2019. https://doi.org/10.1016/j.aci.2019.11.001

A. N. Saeed, "A machine learning based approach for segmenting retinal nerve images using artificial neural networks," Engineering, Technology & Applied Science Research, vol. 10, no. 4, pp. 5986-5991, 2020. https://doi.org/10.48084/etasr.3666

A. U. Ruby, P. Theerthagiri, I. J. Jacob, and Y. Vamsidhar, "Binary cross entropy with deep learning technique for image classification," International Journal of Advanced Trends in Computer Science and Engineering, vol. 9, no. 4, pp. 5393-5397, 2020. https://doi.org/10.30534/ijatcse/2020/175942020

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. "Multimodal deep learning," in Proceedings of the 28th International Conference on Machine Learning, Washington, USA, 2011, pp. 689-696.

C. Tian, and W. Ji, "Auxiliary multimodal LSTM for audio-visual speech recognition and lipreading," 2017, arXiv preprint arXiv:1701.04224v2