Dynamic Sign Language Recognition in Bahasa using MediaPipe, Long Short-Term Memory, and Convolutional Neural Network

Authors

March 28, 2025

Downloads

Background: Communication is important for everyone, including individuals with hearing and speech impairments. For this demographic, sign language is widely used as the primary medium of communication with others who share similar conditions or with hearing individuals who understand sign language. However, communication difficulties arise when individuals with these impairments attempt to interact with those who do not understand sign language.

Objective: This research aims to develop models capable of recognizing sign language movements in Bahasa and converting the detected gesture into corresponding words, with a focus on vocabularies related to religious activities. Specifically, the research examined dynamic sign language in Bahasa, which comprised gestures requiring motion for proper demonstration.

Methods: In accordance with the research objective, sign language recognition model was developed using MediaPipe-assisted extraction process. Recognition of dynamic sign language in Bahasa was achieved through the application of Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) methods.

Results: Sign language recognition model developed using bidirectional LSTM showed the best result with a testing accuracy of 100%. However, the best result for the CNN alone was 86.67 %. The integration of CNN and LSTM was observed to improve performance than CNN alone, with the best CNN-LSTM model achieving an accuracy of 95.24%.

Conclusion: The bidirectional LSTM model outperformed the unidirectional LSTM by capturing richer temporal information, with a specific consideration of both past and future time steps. Based on the observations made, CNN alone could not match the effectiveness of the Bidirectional LSTM, but a combination of CNN with LSTM produced better results. It is also important to state that normalized landmark data was found to significantly improve accuracy. Accuracy within this context was also influenced by shot type variability and specific landmark coordinates. Furthermore, the dataset containing straight-shot videos with x and y coordinates provided more accurate results, dissimilar to those comprised of videos with shot variation, which typically require x, y, and z coordinates for optimal accuracy.

Keywords: Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), MediaPipe, Sign Language