Multiple Sign Language Identification Using Deep Learning Techniques

: The research presents a general overview of sign languages, and a previous survey was conducted on all aspects of sign languages including the tools used to collect sign languages and the best algorithms to achieve the best results. A specialized database is prepared to combine the alphabet signs of the Arabic, American, and British languages, as they are the most important sign languages and the most widespread in the world. Based on different sign languages and deep learning techniques such as LeNet, VGG-16


Introduction
Language has two different modalities: nature and Sign language [1].SL is the second language for human's communication and interactions specially for deaf people.SL is a visual means of verbal communications using gestures expression.Linguistics' expressions are used to express natural language with movement of hands, face, lips articulation, and body movements.Fingerspelling is the main method for translating words using hand movements.According to World Health Organization (WHO) [2], 360 million people are deaf, using about 300 sign languages from different countries.Expecting the growth of this number to 466 SLs by 2020, and by 2050 it will exceed 900 million.
Each country has its own sign language like American Sign Language (ASL), British Sign Language (BSL) and Arabic Sign Language (ArSL).Unlike spoken language SL is not a global sign, so until now, there is no universal sign language unified across all the world to facilitate interaction.An effective Cooperation between the World Federation of the Deaf (WFD) and British Deaf Association (BDA) had done to generalize sign language between all nations, called " Gestuno".
Gestuno is a universal sign language composed of words from various countries like Russia, Great British, United States, and Italy.Unfortunately, Gestuno can't be generalized for global interaction for some reasons, such as no fluent or experts, no-defined grammar for teaching.Finally, it cannot be used by children or ordinary people, The motivation of our research is using ArSL as it was excluded from previous literatures to identify it from different SLs, as signers in Arabic countries are higher than those of North America and Europe [3].To the best of our knowledge, previous research has focused on ASL, BSL, GSL and FSL [4] and excluded other SLs.Another motivation is, models used to identify or detect two or more SL were being applied using traditional machine learning, and rare use of deep learning techniques.
Although many Sign Language Identification (SLI) models have been developed, none of them can be used to recognize multiple sign languages.At the same time, in recent decades, the need for a reliable system that could interact and communicate with people from different nations with different sign languages is a great necessity.So, we need to identify and recognize multiple signs of different SLs at the same time.As excluding deaf people and discarding their attendance will affect the whole work progress and damage their psyche which emphasizes the principle of "nothing about us without us".
Various artificial intelligent and machine learning techniques were applied to many benchmarks of sign language with high and low accuracies [5].Deep leaning has proven how accurate its results are.Also, because the outstanding results of CNN in problems of pattern recognition [6] and image classifications [7].So, our research applies the two CNN models: VGG16 and LeNet, in addition to CapsNet as position and rotation are important features in our studied problem.This research focuses on using deep learning models to facilitate the process of recognizing and identifying different signs from different languages such as ArSL, BSL, and ASL.
One of the main contributions of our research is the benchmark dataset, which was generated using different videos collected from YouTube.The dataset includes three sign languages: ArSL, ASL, and BSL.The variance of instructors and environments, make our dataset a comprehensive one.So, using these three datasets provided a huge amount of images, contributing to accomplish high results and accuracies, as we will discuss in the upcoming sections.The reason for creating this dataset is the lack of alphabetic dataset in different SLs and the need for a huge number of images, which describe each sign in any of the three sign languages.
Deep learning methods were used for the abundance of our large dataset images.VGG16 was also applied due to its high accuracy, but not respond correctly to images with variance in illuminations, lightening, and rotations.CapsNet is an alternative to CNN [8], CapsNet was applied to overcome previous issues, as it can keep valuable information (hand shape, pose, and location) by excluding maximum pooling layers, it also encodes instantiation parameters, by keeping the relationship between them.Finally, it applied dynamic routing between capsules by agreement [9].
The main target of our research is to recognize more than sign language (alphabet characters) of static hand gestures.As shown in Figure 1, it is a sign of ArSL of character called ''Baa".
The remaining of the paper and upcoming sections are listed as following; some of related works and drawbacks to be solved in Section 2. Section 3 describes preprocessing steps to be applied on image datasets which mentioned in Section 4. Section 5 discusses the proposed model.Finally, Section 6 describes results, discussions, and conclusions.

Related Works
Deep Learning (DL) [10] had been widely used in last and recent years due to its great results and the limitations exist in other Machine Learning [ML] algorithms.We will list these limitations in upcoming sections.
In this section we will discuss some of the state-of-the-art works to be investigated and highlighted to discover any research gap.We collected some of related works for the last 10-years related to our used model such as LeNet, VGG16 and CapsNet model, which applied to different SLs datasets.Choosing deep learning models don't require more preprocessing steps and features extraction like using traditional machine learning techniques such as HMM, SVM, and KNN classifiers [11].
According to A. Sultan et al, the detection and recognition tasks of different sign languages are based on three main systems.The first is based on glove-based system which contains some of built-in-sensors utilized to capture motion.The second one is vision-based system depends mainly on images captured from digital cameras.The latter is of course much cheaper, but the boom of deep learning is another reason to make it attractive.The final one is based on virtual button.
CNN architecture [7] called dense model which was applied to his own static hand gestures reaching an accuracy of 90.3%.Dataset of alphabet ASL containing more than 50,000 images, collected using same lightening and background environment.VGGNet a deep neural network [11] was applied on multiple sign languages (ASL, ISL (Irish Sign language), and ArASL (Arabic Alphabets Sign Language).Accuracy of 99% was achieved for ASL and BSL, but 98% accuracy was achieved for ArSL.The reason for high accuracy is that ASL and BSL had lower number of classes than ArASL.Enhancement to VGG16 model was proposed [12], by adding 2 dense for a dataset of 5,391 image of 26 English alphabet, measuring the influence of distance between recognition area and the screen, resulting in, that 20 to 40 cm is the best distance for better recognition which shouldn't exceed 80 cm.99.902% was the training accuracy and 99.910% for testing accuracy.
Applied some of CNN models [13] such as VGG16, ResNet, EfficientNet, and AlexNet to recognize real-time Arabic sign language alphabets.AlexNet was the best one with training accuracy of 99.75% and 94.81% for test accuracy.A.-J. Tanseem N and A.-J. Abu-Jamie used a dataset from Kaggle include alphabets from (A to Z), space, delete, and nothing to predict ASL characters.After 20 epochs of training dataset on pre-trained model of VGG-16 the final accuracy is 100%.
LeNet-5 [15] was implemented to predict to 1500 images for each digit from 0 to 9, which was augmented to produce 3000 images for each sign, which is captured using simple and pure background with webcam.Image processing steps was applied to remove noise such as, color conversion, blurring and sharpening.The total accuracy is 99.8% with 90% as validation accuracy.
MNIST Kaggle sign language of ASL characters was used [16], except Z and J, as it required motion to describe sign.LeNet and CapsNet models were used to recognize signs.LeNet was applied with overall accuracy equals to 82% and CapsNet was applied using two versions of dataset: the MNIST dataset and an augmented version of MNIST dataset, producing accuracy of 88% and 95% respectively.Proposed a CapsNet model [17] instead of traditional CNN models.CapsNet model was able to recognize American digits from 0 to 9 and alphabets from A to Z (excluding j, Z because they required motion capture), giving a testing accuracy of 99.52% for 100*100 RGB input images and 99.94%.For 32*32 RGB on sign language digit dataset.While getting a test accuracy of 99.60% with 28*28 grey images of MINST dataset.Proposed a deep learning capsule model for predicting Indian sign language through signal received from IMU wearable device.Got two accuracies based on number of routings: 99.72 and 99.56 training accuracy for 3 and 5 iterations respectively.

Preprocessing Steps
Preprocessing is a very important step if data is not clear and has a noisy or incompatible data, as known that "Garbage in-garbage out".Our dataset is almost clear and give an amazing result, but we applied some preprocessing steps to get higher results.
Hand segmentation is a very important step as signature of alphabetics SL requires movements of hands only.Identifying image pixels which comprise hands and output them as a mask to our proposed model.Many papers had had hand segmentation on different methodology, some are based on color space and others are based on machine learning model [18].
Videos collected from YouTube were montaged and enhanced using Camtasia editor.Segment the most important region of interest ROI, which include human face and hands.
We don't need more image enhancements or more image segmentation algorithms.
We resized images to 64*64 pixel, creating index label for each class, then apply image labels using LabelBinarizer function from sklearn library.Shuffling images was applied also to images and labels followed by image normalization.

Dataset and Environment Initialization
Getting dataset compatible with our problem was a great issue.We crawled the web to get signs from different sources, focusing on YOUTUBE as the main source to collect dataset.Dataset has three categories of alphabets like: ASL (American Sign Language), BSL (British Sign Language), and ArSL (Arabic Sign language).Dataset was collected from specialized persons who are interested to learn and guide people how to communicate and contact with deaf people.Those specialized people are certified instructors in different learning centers.
As shown in Table 1, ASL consists of 26 letters collected from 14 signers, resulting in 41,959 images.Also, BSL consists of 61,120 images collected from professional instructors and trainer on YouTube to represent 26 alphabets.ArSL alphabets contains 38,483 images demonstrated by 16 people and trainer separated into 29 classes.Table 2 shows number of each alphabet in our benchmark.

Environment initialization
Experiments are applied to my personal laptop dell precision M4800 with 16G of RAM using windows 10 as the main operating system.A GPU NIVIDIA Quadro K1100M of 2GB was used.I installed and utilized it with TensorFlow and keras to run my deep learning models LeNet [19] and VGG16 [20] and saved a lot of time.Also used Google-Colab as a second environment to run CapsNet model as my own laptop doesn't have enough GPU capacity to handle and run this model.

Proposed Model
Figure 2 presents the methodology of our work through this research.Starting from the initial step of gathering dataset images to the final step of predicting personal images.The figure depicts the overall process and methodologies applied through three phases; input images, as we described previously in dataset section.Then apply some image preprocessing steps such as: resizing, shuffling, and image labeling.We don't need more image enhancement algorithms as will be shown in results section.The next step is to choose the best CNN model suitable to our dataset.We applied LeNet model from scratch and used transfer learning to apply VGG16, finally we used CapsNet model, and we will discuss the reason for choosing each one separately in upcoming sections.

Rotational Equivariance
CNN models don't respond correctly to rotation or large transformations, so CNN model is not able to classify and recognize rotated images well, because of pooling layers which tends to lose more information.On the other hand, CapsNet model can recognize rotated image easily like human brain.The idea of CapsNet appears in the following example: the main idea is that how CapsNet identify images and recognize it.It captures image and break it out to individual features, such as hand's fingers, Figure 3. Alphabet "B" of ASL with breaking out its features to individuals and rotating them.See Figure 3a (ASL: B character).What about rotating the image by a give angle=30o or any value?Also, what about making image upside down image?This will need new features as shown in Figure 3b.So, this approach seems like a brute search for all possible angle rotation.CapsNet do this easily by a property called "rotational equivariance".Rotation idea can also be updated to scaling, skewing, and thickness.So, the rotated images can be recognized easily.Equivariance can be applied by three steps convolution, reshape functions, and squash function.

Dynamic Routing by Agreement
CapsNet applied dynamic routing by agreement [9].Which tries to use the minimum features required to detect and recognize the hand gesture.If you break out alphabet "B" in Figure 3a and check its label using only one of its low levels such as, the middle finger feature, it will not be enough to predict the gesture.We need to define iteratively more complex features to reach the correct label.So, we give more weightage to the features to recognize labels and can "route" the correct information to feature detector for classification process.
According to [9] the best number for routing iterations is 3, as it gives the minimum loss value and high accuracy.In our implementation it was less than 5. Figure 4 shows the complete architecture of CapsNet model.Figure 4a shows the architecture of CapsNet model starting from the convolution layer which receives the input image.Output is a feature map of an array of size 18 maps, then reshaping process of feature maps into two vectors of 18=2x9, which represents every location in the image, making sure each vector between 0 and 1 because it represents the probability of existence in location of image.The previous check is called "squashing".The primary capsule layer is used to detect objects and which capsule belongs to this object and if the object is located in image.The process is called "routing by agreement" as illustrated previously.Higher capsule layers tend to find weights called "routing weights", finding sum of wights for the first iteration.After first iteration, it tries to predict the output and compare with actual one.Again, we iterate the previous steps with second iteration, and so on of more iterations to find the predicted class correctly.
After the routing by agreement complete, we need to compute probability of each class.The original paper [9] used margin loss function to calculate the probability of each class.The vector length is computed using a layer added to the higher layer.The computed squared length of the vector is compared to two values 0.9 and 0.1.If object existed in the image, then length must be less than 0.9 and if it doesn't exist, will be less than 0.1, see Equation (1) and Equation (2).
| k | 2 0.9, object exist. ( Equation for existence of object in each class.
Equation for absence of object in each class.
Where, k is the vector length of class of object k.A decoder network Figure 4b is added to the higher capsule layer.Which is a three fully connected layers.First two layers is activated using leaky ReLU as a new hyper parameter, as the original paper [9] used ReLU function, and sigmoid function for the last one.This decoder is used to reconstruct the input image and compute reconstructed image using the following equation Equation (3).The total loss is calculated using Equation ( 4).alpha equals 0.0005 as mentioned in [9] which is used to minimize the reconstruction loss. - Equation for computing loss in reconstructed image.The total loss is calculated using Equation ( 4): -Total Loss=Margin Loss + alpha (4) * Reconstruction Loss eq (4) Equation for total loss calculation.
For the activation functions, we had tried three type such as ReLU, Siwsh, and Leaky_ReLU.Swish was the worst one.On the other hand, we got a promising result using Leaky_ReLU.G. B and S. Natarajan mentioned that Leaky_ReLU outperforms other activation functions.

CNN Architecture
CNN is a Convolutional Neural Network, which excelled in various fields such as image processing, Computer Vision (CV) and Natural Language processing (NLP), image classification, object detection and many fields.
Going deeply in CNN is absolutely necessary to extract more features from confused images.A huge and wide architectures are built based on CNN to enhance CNN performance and get higher results.One of them is LeNet architecture, which is basically consisted of 5 layers with its latest version LeNet-5 [19].One of our proposed architectures is based on LeNet with some modifications as shown in Figure 5.It has three convolution layers, three maximum pooling layers, and two fully connected layers followed by the layer for output, which is changed to 26 for ASL and BSL and to 29 for ArSL.Image layers were flattened then we applied batch normalization three times before and after each dense layer.Drop out was used by percentage of 0.6 after the first dense layer and percentage of 0.7 after the second dense layer.
Adam optimizer was used for optimization with a loss function such as, categorical cross entropy.Using ReLU as an activation function in dense and convolution layers and SoftMax layer in output layer.The number of epochs and batch size changed with three values of 50, 100 and 150.

VGG16 Architecture
VGG is based on classical CNN architecture.VGG stands for Visual Geometry Group.Vgg16 [20] is a pretrained model, which has its own weights to be applied to our own model.It has advantages of using over other architecture such as AlexNet, GoogLeNet and ResNet.VGG has a very deep neural network (16-layers) compared to AlexNet, which is 7 layers, So VGG-16 could extract more features.VGG16 was applied to ImageNet dataset [22] which has 1000 class of different categories, reaching a test accuracy of 92.7%.We used transfer learning based on VGG16 to apply this architecture to our benchmark dataset as shown in Figure 6.
Our research method is to apply transfer learning using ImageNet weights.Transfer learning has mainly four types shown in four quadrants as depicted in Figure 7.We used Q1 as it is the similar one to our problem which is large dataset and different from pretrained model's dataset (ImageNet).
All layers were frozen and just change the size of the fully connected layers to be 512, followed by output layer of SoftMax function.Input images to the model of size 64*64 pixel.Dataset was divided to training and testing dataset with percentage of 80% and 20% respectively, then split training dataset to training and validation of 75% and 25% consecutively.
Setting up hyperparameters was not an easy step as it requires more tries to get the best values.Learning rate equals to 1e-5, activation functions used in the fully connected layers is ReLU.The output layer is updated to 26 for ASL and BSL and to 29 for ArSL using SoftMax as an activation function.RMSProp is used as an optimizer with categorical cross entropy as a loss function with 25 epoch and 128 batch size.The fully connected layers were followed by dropout and batch normalization.The dropout layers equal to 40% and 30% for the FC layers consecutively.Table 4 shortens the hyperparameters values for the deep learning models used for training the three datasets.

Results Comparisons
This section compares and visualize our model's results with other state-of-the-arts.In order to test our model functionality and accordance with dataset, Table 3 shows different accuracies of applying LeNet, VGG16 and CapsNet models.We noticed that VGG-16 is the highest accuracy once trained on BSL dataset.Concluding that the huge number of BSL image cause this highest accuracy between different datasets.While other works such as K. Suri and R.
Gupta performs high accuracy than our research, but it has a drawback, as they used a hand device.Also [12] outperforms other state-of-the-arts but used less numbers of dataset images of 5,391 images, which is not enough for training and testing, comparable to our dataset size.

Experiments and Results of LeNet
We applied LeNet model with different number of epochs and batch sizes.We noticed that low number of epochs and low batch size of ArSL with huge number of images produce low loss value and satisfied accuracy of 94.95%.On the other hand of BSL, increasing epochs and batch size, led to increasing of accuracy for the same number of image dataset.We got 97.45% and 0.0468 for loss, with epochs and batch size of 100.To increase results of accuracy and decrease loss, we used 150 as number of epochs and batch size.For BSL, we used an epoch and batch size of 150, to get less results with different sizes of images.

Experiments and Results of VGG16
Vgg16 was the best one for training and testing our dataset, getting the higher accuracy on BSL dataset 99.69%, because of its large number of images and a smaller number of loss values comparable to LeNet model.ArSL gives an accuracy of 99.05%, then ASL produces an accuracy of 98.5%.Also, we conclude, that VGG16's loss values are lower than LeNet model.

Experiments and Results of CapsNet
CapsNet model was one of the models used to predict and classify images specially, images with different transformations.CapsNet accuracy was very high compared to LeNet, and loss values were very low with mentioned to LeNet and VGG16.We got high accuracy of 99.56%, then 98.4848% for ArSL, 98.4286% for ASL.Concluding that large datasets led to high accuracies and low loss values.
For predicting the language category, we used some samples of each alphabet in each language to form the category of each language.So, we got 3,737 images of ArSL, 3,827 images of ASL, and 5,568 of BSL.We trained and tested these images using VGG16 only, as it was the highest accuracy to identify sign language alphabets.We trained our new images and got training accuracy of 99.99%, and 99.9 for testing.Epochs number equals 10 and batch size equals 128. Figure 8 shows loss and accuracy curves for training images.Also, figure 9 shows the true and predicted images of each sign language.

Conclusion
In our paper we applied a different deep learning models to recognize and identify static hand gestures of ArSL, BSL, and ASL.The LeNet was developed from scratch to train and test every one of our datasets.Transfer learning was implemented using VGG16 which gives the highest accuracy especially on BSL dataset.Also, to overcome image rotation, scaling, and different transformations of image, we applied Capsule network which is a deep learning model depending on concept of Routing by agreement.Our datasets perform perfectly on the three models, which emphasizes that our dataset is a good one different applications of sign language.As a future work, U-net model can be used to preprocess incoming images to be identified more efficiently.Recognize and identify real-time sign language images and videos using deep learning models.

Figure 2 .
Figure 2. Proposed workflow of our research.

Figure 3 .
Figure 3. Alphabet "B" of ASL with breaking out its features to individuals and rotating them.

Figure 7 .
Figure 7. Depicts the four quadrants of transfer learning.

Figure 9 .
Figure 9. Confusion matrix of true and predicted images in ASL, BSL, and ArSL.

Table 2 .
Number of dataset's signs.

Table 4 .
Hyper parameters of the trained models.