Fusion Machine Learning Strategies for Multi-modal Sensor-based Hand Gesture Recognition

Hand gesture recognition has attracted the attention of many scientists, because of its high applicability in fields such as sign language expression and human machine interaction. Many approaches have been deployed to detect and recognize hand gestures, like wearable devices, image information, and/or a combination of sensors and computer vision. However, the method of using wearable sensors brings much higher accuracy and is less affected by occlusion, lighting conditions, and complex background. Existing solutions separately utilize sensor information and/or only use sensor information processing and decision-making algorithms over conventional threshold comparison algorithms and do not analyze data or utilize machine learning algorithms. In this paper, a multi-modal solution is proposed that combines information for measuring the curvature of the fingers and sensors for measuring angular velocity and acceleration. The provided information from the sensors is normalized and analyzed and various fusion strategies are used. Then, the most suitable algorithm for these sensorbased multiple modalities is proposed. The proposed system also analyzes the differences between gestures and actions that are almost similar but in fact, they are just normal moving gestures. Keywords-hand glove; acceleration sensor; hand gesture recognition; flex sensor; human-machine interaction


INTRODUCTION
Hand gestures are one of the most natural ways of interaction between people (e.g. sign language [1][2][3][4][5][6]24]) or Human Computer Interaction-HCI [7][8][9]. Conveying messages in sign language through hand gestures has attracted the attention of many researchers and technology developers. As a consequence, many hand gesture recognition approaches have been proposed, such as using the change of radar waves when the user changes the state of the hand [10][11]25], utilizing image information [12][13][14], or sensors physically attached to the hand [15][16][17]. The radar-based method [10] is easy and convenient for end-users because of independent equipment but it is dramatically affected by environmental noises and the number of antennas [27]. There are many moving objects in the normal environment that can also cause changes in feedback waves such as movement of other people, or other body parts. For the image-based approach [12], hand gesture recognition systems often face many challenges such as high time cost for hand detection and hand recognition, effects of illumination, occlusion, complex background conditions, etc. The hand mounted sensor-based method [15], although limited due to device dependence that has much higher accuracy compared to other methods and especially it face to many criteria when the actual control system requires high accuracy. Thus, this solution is preferred in many special cases.
Hand-mounted sensors were applied to measure the change in the hand shapes and hand movements that were quite efficient because of their accuracy and real-time response. Many solutions have been proposed, such as the electronic gloves [17][18]. These gloves have integrated flex sensors, mounted on the fingers, to collect the hand's curvature changing. However, this solution only measures the change in the flexure of the fingers, but it cannot obtain the movement of the hand. In [19][20], the authors proposed a method to attach the sensors on the hand for receiving velocity and angular acceleration. This method achieves the alternation of hand movement, but it is not possible to obtain the changes of hand shapes. In [21][22][23], the authors combined both curvature and velocity sensors, but the classification of hand gesture categories was performed by the conventional comparison structures with simple instruction sets of microcontrollers (e.g. ARV, 8051 with compare and check condition instructions), without evaluation or survey changing of parameters of sensors or machine learning algorithms. Quantitative evaluations were not implemented to compare the defined activities with normal human activities.
In this paper, a new solution is proposed, one that combines both hand shape and hand movement sensors with various fusion strategies of features. The obtained results have success rates of up to 99.87% for the stable hand gesture dataset and 97.59% when both mobile and immobile hand gesture datasets are considered. These results are higher than the 97.4% and 86.3% for seen and unseen users of 12 gestures in [28] that fed the data of finger's curvatures to a convolution neural network. There are no published hand gesture datasets based on mounted sensors and the published ones do not provide both hand's curvatures and position at the same time. So, in this research, we collected two new hand gesture databases, one with stationary hand and one with moving hand. The data flows will be fed into classifiers with various fusion strategies of machine learning techniques to find the most optimal and suitable classification solution.

II. PROPOSED METHOD
Our proposed framework for hand gesture recognition is illustrated in Figure 1. In this research, multiple modalities of hand ሺ݈ ∈ ሾ1, ‫ܮ‬ሿ, ‫ܮ‬ ൌ 3 corresponding to ݈ ∈ ሾ‫,ܣ‬ ‫,ܩ‬ ‫ܨ‬ሿሻ are captured by two streams: • Finger's curvature: five flex sensors are used to present the curvature variation of fingers.
• Hand's motion tracking and angle: sensor MPU6050 is integrated by 3-axis MEMS of gyroscope and 3-axis MEMS of accelerometer.
All features are processed and synchronized by an MCU (Microcontroller Unit) and are transferred to a PC. Finally, various classification strategies were used, either single or multi-modalities. The cascade steps in our proposed framework will be presented in detail at the following Sections.

A. Hardware Design for Multiple features of the Hand Glove
The detailed hardware design of the electronic glove is illustrated in Figure 2. Five flex sensors are permanently mounted along the fingers of the glove to collect the curvature based on the change of the corresponding resistance value. The resistors' data are preprocessed to convert into voltage value before conversion from analog to digital with a 10-bit ADC (Analog to Digital Converter) resolution, corresponding to a range of values from 0 to 1023. The values of flex sensors from the thumb to the little finger are denoted by ‫ܨ‬ ଵ to ‫ܨ‬ ହ ሺሺ‫ܨ‬ ଵ , ‫ܨ‬ ଶ , ‫ܨ‬ ଷ , ‫ܨ‬ ସ , ‫ܨ‬ ହ ሻ/‫ܨ‬ሺ1,2,3,4,5ሻ corresponding to the curvature data of the five fingers. At the same time, the values of the 3-axis MEMS gyroscope and 3-axis MEMS accelerometer of the MPU6050 sensor were collected. The resolution of ADCs is 16 bits with a range from 0 to 65535, denoted by ሺ‫ܣ‬ ௫ , ‫ܣ‬ ௬ , ‫ܣ‬ ௭ ሻ/‫ܣ‬ሺ‫,ݔ‬ ‫,ݕ‬ ‫ݖ‬ሻ and ‫ܩ(‬ ௫ , ‫ܩ‬ ௬ , ‫ܩ‬ ௭ )/Gሺ‫,ݔ‬ ‫,ݕ‬ ‫ݖ‬ሻ. After the MCU collects the two data streams (flex sensors and MPU sensor) at the same time, all data are packaged and sent to the PC via the USB port according to the UART standard. Harware design of the hand glove.
In addition, each time, the information of the hand gesture is transferred by a feature vector which is composed by eleven elements in total: A ሺ‫,ݔ‬ ‫,ݕ‬ ‫ݖ‬ሻ , G ሺ‫,ݔ‬ ‫,ݕ‬ ‫ݖ‬ሻ , and F(1,2,3,4,5), respectively. This feature is presented in detail in (1) and will be utilized in various strategies in detail in the next section:

B. Hand Gesture Dataset
In this paper, twelve hand postures are used, as illustrated in Figure 3. Each hand gesture is captured when the end-user is immobile or in movement. In practice, the system will recognize the hand gestures at any time and any place, i.e. hand morphologies during normal human activities that could have similar characteristics to the previously defined dataset. In addition, the feature of hand shape is only based on the curvature of the fingers that are collected by the five flex sensors. It is apparent that the accelerometer sensors and velocity sensors provide useful information about the movement and the direction of the hand's movement.   Figure 4) and (2) the end-user implements the same hand shape but hand and body are moving (green, yellow, red and orange colors in Figure 4). It is clear that feature A(x,y,z) or feature G(x,y,z) of the same hand gesture represent the hand's movement. It could effectively separate between immobile hand mobility hands while they do not have meaning for stable gestures. In practice, the users are usually immobile when they want to control the device and often put their hands in front of their body. Meanwhile, the end-user's hand shapes exist at both stable and moving situations. Therefore, we collected 12 gestures as shown in Figure 3 when the hand is in front of the user's face, when the hand is immobile and during normal movement. Each category will be separately labeled by ‫ݏ݁ܩ‬ ଵ (gesture of class k, hand is frontal of body and stable) and ‫ݏ݁ܩ‬ ଶ (gesture of class k, hand is at any position and movement). We divided the data into two datasets named HandGlove1 and HandGlove2. HandGlove1 is composed by ‫ݏ݁ܩ‪ሾ‬‬ ଵ ሿሺ݇ ∈ ሾ1, ‫ܭ‬ሿ, ‫ܭ‬ ൌ 12ሻ and HandGlove2 by ‫ݏ݁ܩ‪ሾ‬‬ ଵ ; ‫ݏ݁ܩ‬ ଶ ሿሺ݇ ∈ ሾ1, ‫ܭ‬ሿ, ‫ܭ‬ ൌ 12ሻ. This means that HandGlove2 dataset has a duplicated number of categories up to K = 24 classes. Each gesture was collected three times and each time consists of about 200 samples. A total of 15 adults ‫ݎ݁ݏܷ‪ሺ‬‬ ሺ݆ ∈ ሾ1, ‫ܧ‬ሿ, ‫ܧ‬ ൌ 15ሻሻ , including 7 females and 8 males were invited to collect our dataset at different days and at various times of the day. These databases were normalized and classifiers were used.
C. Data Processing As described above, the data collected from the flex sensors are denoted by F(1,2,3 As a result, the feature vector could be presented by: Figure 6 illustrates that, if only using the curvature feature from the flex sensors, the data distribution of the gestures is not segregated than using combinations of the motion features. The composition of 6 motion features enables gesture types to be separated into more distinct spatial domains. This result is only qualitative but not quantitative. These normalized features will be put into classifiers with single patterns and combined features as described in Section III.D.

D. Hand Gesture Classification
Multi-modal data stream is presented above with F, A, and G which are combined using different fusion techniques. In this research, three fusion techniques are investigated: late fusion, early fusion, and Multiple Kernel Learning (MKL) [2]. In the following, we will briefly survey these methods that are utilized in the classification block of Figure 1.

1) Early Fusion
For early fusion, all normalized feature vectors of the 3 modalities are concatenated into a final feature vector as presented in (5): Then, the feature vector ‫ܨ‬ is used as an input of a final multiple SVM classifier to predict the hand gesture label.
Next, the maximum operator for these score vectors ܵ is applied to obtain the final score vector as presented in (6): Then, the final decision is obtained as illustrated in (7)

3) Multiple Kernel Learning
MKL is an algorithm that combines a set of base kernels that could represent different similarity measures of different sources of data. As a result, each feature vector from the ݈ ௧ ሺ݈ ∈ ሾ1, ‫ܮ‬ሿሻ modality utilizes a kernel function to compute the corresponding kernel matrix as shown in (8): where ‫ݒ‬ ∈ ሾ1, ܸሿ, ሺܸ ൌ 3ሻ. Then, kernel matrices ‫݈݁݊ݎ݁ܭ‬ ௩ are used to combine for ‫݈݁݊ݎ݁ܭ‬ factor as presented in (9): where ‫ܿ݊ݑ݂‬ denotes the function form. The combination coefficient ߤ sees that the values are bound to the predefined rules or are optimized by the learning process of MKL [1,2]. In this research, EasyMKL [1] with a binary margin maximization MKL algorithm is chosen which uses the convex summation function. The coefficients µ are restricted to be non-negative and sum to 1. A final kernel machine classifier will decide the label based on the combined kernel matrix ‫݈݁݊ݎ݁ܭ‬ by a SVM classifier.
III. EXPRIMENTIAL RESULTS In this paper, two datasets (HandGlove1 and HandGlove2) were utilized. The "leave-one-subject-out cross-validation" protocol is followed [3]. This means that, we took samples from one subject for testing and samples from the remaining subjects for training. Then we compute the average accuracy of every experiment.

A. Single Modality for Hand Gesture Recognition
In this section, we used HandGlove1 dataset in separating the following data types: Five components are the finger curvature sensors F(1,2,3,4,5), 3 components measure the angle of the hand A(x,y,z), and 3 components show the angular velocity G(x,y,z). Each type is firstly normalized. It is then put into the SVM classifier. The classification results on each modality are presented in Table I. Table I shows that the finger's curvatures of the hand gesture dataset (Figure 3) obtained the highest recognition result (92.15%) while A(x,y,z) cue has the lowest at only 36.48%. Although the result of the G(x,y,z) modality is higher than the cue modality, but it still is ineffective for the defined dataset with 42.76%. This result shows that each data type gives low results, not suitable to control in practice.

B. Hand Gesture Recognition on Various Fusion Strategies
In this part, we will investigate different strategies (early fusion, late fusion, and MKL) on the three streams. Evaluations are conducted on HandGlove1 dataset. The results are shown in Figure 5. Hand gesture recognition on various fusion strategies. Fig. 6. T-SNE of 12 gestrures with various modalities. Figure 5 shows that: • KML obtains the highest accuracy at 94.58%, 90.18%, 97.31%, and 99.87% with combination of A and F, A and G, G and G, and A+G+F respectively. The late fusion method reached the lowest accuracies on all data combinations (around 75% and the smallest at 46.17% for A+G modality). We can see that early fusion is quite simple while its results are lower than KML's. The late fusion method requires many classifiers but its accuracy is far smaller than KML's. Therefore, the KML solution will be utilized in the rest of the testing.
• Combinations of the 3 modalities are the highest on all classifiers and account for 97.69%, 86.02% and 99.87% for early fusion, late fusion, and KML respectively. These results can be explained through Figure 6 that shows the distribution of data using the t-SNE method [26]. When G is used, the hand gesture classes distribute in the overlap domains. When  Figure 6(c) the distribution of the categories has improved but the separation is not clear enough. In Figure 6(d), A, G, and F are utilized and the data distribution domains of the 12 hand gestures are completely separated.
• Association of the three modalities accounts for slightly higher results in HandGlove1 dataset because the gestures in this dataset have negligible displacement. The hand rotations are almost the same and the hand is raised in front of the body. However, the system is trained by HandGlove1 dataset, and then this trained model is deployed in real application. As a result, the online system has many mistakes. To see the effectiveness of A and G modalities, we will be evaluated in the below Section IV.C.

C. Distinction between Command Action and Normal Gestures
In this section, the KML fusion strategy will be used for testing on the HandGlove2 dataset. All gestures belonging to the group ‫ݏ݁ܩ‬ ଶ are labeled with labels from 13 to 24. The recognition accuracy is considered only with gestures with labels from 1 to 12. These gestures are defined and expected to be correctly recognized by the system. The user raises his/her hand in front of the face and the state of the hand is stationary. Only the hand shape changes according to the specified gesture means. In this test, a combination of different modalities will be used, such as: A and F, G and F, and A, G, and F. The evaluation results are shown in Figure 7. Efficient of modalities combination for hand gesture recognition accuracy (%) in a real environment.
In HandGlove1 dataset, the recognition results of the ways of combining data have not a big gap from 92.15% to 99.87%. Hand gestures are distinguished with other hand postures. In addition, end-user's hand and body are immobile so that the curvature of the fingers could provide enough cues of hand shapes in this dataset, so, the recognition results are quite good. However, in HandGlove2 database, the recognition accuracy is low when the 5 flex sensors are utilized. This is because of for the same changing shapes, i.e. the same five F elements, one of them moves like normal hand movement (different A and G elements). Gestures 1 to 12 have the same hand shape with corresponding gestures 13 to 24 ( Figure 8). Therefore, using only the curvatures of the fingers does not give enough information to distinguish between a control operation and a normal operation as shown in Figure 8. For example, gesture 1 (red color) and gesture 13 (slow brown color) in Figure 8 have the same hand shape but hand gesture 1 is immobile while gesture 13 is in movement. Distributions of these gestures overlap in Figure 8(a), but are reparative in Figure 8(b). Thus, there is a big gap between single modal (56.72% with F) and multi-modal (97.59% with A, G, and F) in the second row of Figure 7.
In HandGlove2 dataset, combining F and A or F and G showed a significant improvement in accuracy, increasing up to 33% and 36% respectively (Figure 7). Especially, the coherency of the 3 components (A, G, and F) obtained the highest recognition accuracy, which was increased up to 97.59% in this dataset. The proposed framework could apparently distinguish between the 12 hand gestures in both immobile and mobile states. The efficiency of the system dramatically increased and practical applications can now be deployed. IV. CONCLUSION AND DISCUSSION This paper presents a comparative analysis of recent fusion strategies for static hand gesture recognition on three modalities of an electronics hand glove. Among the evaluated fusion strategies (late fusion, early fusion, and KML), KML performed better, achieving the highest recognition accuracy. The evaluation results on two datasets show great interest in combining multi-modalities (5 curvature fingers from flex sensors, 3-axis gyroscope and 3-axis accelerometer from MPU6050 sensor) to increase accuracy. In most cases, multimodal KML models can achieve an accuracy rate above 97%. This performance is remarkable and promises a feasible solution for deploying gesture-based applications in practice. Finally, it was found that there is a notable gap in recognition accuracy between the same hand shapes but different normal hand movements. The last remark opens up new research directions that require further investigation on data combination with image information, using deep convolutional neural networks, and online learning. Once these bottlenecks are resolved, the development of a gesture-based interface in practical applications is straightforward.
ACKNOWLEDGEMENT This research was funded by the Electric Power University (EPU) under the grant project titled "Research on multi-modal and multi-view hand gesture recognition combinative utilizing sensors and images".