A Ternary Neural Network with Compressed Quantized Weight Matrix for Low Power Embedded Systems

—In this paper, we propose a method of transforming a real-valued matrix to a ternary matrix with controllable sparsity. The sparsity of quantized weight matrices can be controlled by adjusting the threshold during the training and quantizing process. A 3-layer ternary neural network was trained with the MNIST dataset using the proposed adjustable dynamic threshold. The sparsity of the quantized weight matrices varied from 0.1 to 0.6 and the obtained recognition rate reduced from 91% to 88%. The sparse weight matrices were compressed by the compressed sparse row format to speed up the ternary neural network, which can be deployed on low-power embedded systems, such as the Raspberry Pi 3 board. The ternary neural network with the sparsity of quantized weight matrices of 0.1 is 4.24 times faster than the ternary neural network without compressing weight matrices. The ternary neural network is faster as the sparsity of quantized weight matrices increases. When the sparsity of the quantized weight matrices is as high as 0.6, the recognition rate degrades by 3%, however, the speed is 9.35 times the ternary neural network's without compressing quantized weight matrices. Ternary neural network work with compressed sparse matrices is feasible for low-cost, low-power embedded systems.


INTRODUCTION
Deep Neural Networks (DNNs) have achieved impressive success in the field of computer vision [1][2][3][4]. Modelling the human brain using DNNs requires a massive number of computation tasks including addition and multiplication. Therefore, it is often challenging to implement DNNs on lowpower edge devices such as mobile embedded systems [5]. Edge computing has been attracted much attention recently because it has a lot of advantages in terms of cost and security. To run DNNs on low-power edge devices, many optimized DNN architectures have been proposed. To increase the accuracy, DNNs can be trained on a GPU and then the trained models are loaded to low-cost embedded systems, such as the Raspberry Pi board [6][7][8]. Another method is to add the external accelerating Neural Computer Stick (NCS) to the Raspberry Pi when deploying the DNNs on it [9]. These deployments of DNNs on the low-cost Raspberry Pi board are based on the full-precision weight, which dramatically consumes power and processing time. The memory usage and inference speed of such models have not been considered. An alternative technique to enhance the performance of DNNs deployed on low-cost computers is to quantize the parameters to speed up the DNNs' run-time and reduce memory consumption [10][11][12][13][14][15][16][17]. Traditionally, 32-bit floating-point is used for numerical formats in DNNs, which has a big impact on speed and memory usage. Reducing the number of bits representing DNN parameters is considered for low-power edge devices. In particular, using numerical formats with lower precision than 32-bit floating point yields numerous benefits. 16-bit floating-point and 8-bit floating-point are commonly used for lightweight DNNs without sacrificing accuracy [5]. Substantial research efforts to use lower precision such as ternary and binary representation of parameters (synaptic weights) have been invested to make possible the implement of DNNs on low-power edge devices [10][11][12][13][14][15][16][17].
A binary neural network constrains synaptic weights to the binary space of {-1,1}. In a binary neural network, the conventional 32-bit floating-point multipliers are replaced by the logical XNOR operations to speed up running time and reduce memory consumption. However, the accuracy of binary neural networks is lower than full-precision neural networks because only one bit is used to represent the synaptic weight and the activation function. To increase the accuracy, ternary neural networks that constraint the synaptic weights to the ternary space {-1, 0, 1} have been proposed [14][15][16]. When training a ternary neural network, the weights are updated using real-valued variables and are then constrained to -1, 0, or +1 using the ternarization function [14][15][16]. The ternazization with dynamic threshold yields faster convergence in the training phase and higher accuracy in the inference phase [18]. However, the dynamic threshold, which is based on the mean and standard deviation of real-valued variables, produces an unpredicted number of -1, 0, and +1 bits in the synaptic weight matrices. In this work, we adjust the threshold during the quantization process to obtain sparse weight matrices with  Figure 1 shows the concept of a ternary neural network in which the weights are constrained to -1, 0, and 1 [18,19]. x 1 -x n are binary inputs and h 1 -h m are the neuron outputs of the hidden layer, which are also quantized to binary. y 1 -y k are the neuron outputs for k classes. In Figure 1, Wh is the inputto-hidden layer weight matrix and Wo is the hidden-to-output layer weight matrix. Here, the weight matrices are composed of -1, 0, and 1 representing the inhibitory, contactless, and excitatory synapses. The conceptual diagram of a ternary neural network, where the synaptic weights are -1, 0, or +1 representing inhibitory, contactless, and excitatory synapses.
Ternary neural network represents weights using fewer bits than a full-precision neural network. The ternary weights can be represented by lower bit signed integer values or complementary binary arrays [19]. The amount of required memory for the model's parameters of ternary neural networks is substantially less than that of the full-precision neural networks. The ternary neural network is trained using the traditional gradient descent method that updates the weights in the direction of the maximum decrease of the loss function. The weights are updated with real values and transformed to the binary values using the following quantization function [18]: where w threshold is the threshold weight, w r is the real-valued weight, and w t is the ternary weight of -1, 0, or +1. By using (1), the ternary weights are obtained by comparing the realvalued weights with a positive threshold value. It can be observed that for every training iteration, the distributions of synaptic weights are different. Therefore, a dynamic threshold is selected using the Gaussian distribution proposed in our previous work [18]. The proposed method attempts to equalize the number of negative weights, zero weights, and positive weights. The quantization function with dynamic threshold is presented in (2) [18]: where µ and σ are respectively the mean and standard deviation of real valued synaptic weights. According to the Gaussian distribution, if the threshold is selected to be µ-0.44σ and µ+0.44σ, we obtain 33%, 34%, and 33% as the number of negative synaptic weights, zero-value synaptic weights, and positive synaptic weights respectively [18]. The percentages of negative, zero, and positive synaptic weights are maintained constantly every epoch of the training process because the threshold is adapted to the distribution of synaptic weights.
Increasing the number of zero values in quantized weight matrices leads to higher sparsity. The sparse matrix can be compressed to reduce the memory consumption and the matrix multiplication time. In this work, we control the percentage of zeros by modifying the quantization function as follows: where λ is a variable that controls the threshold. In (3), if we increase λ, the number of zeros will increase. The higher the value of λ, the higher the sparseness of the quantized weight matrices. The sparse weight matrices can be compressed to reduce the memory usage and speed up the forward pass. The sparse weight matrices are compressed using the Compressed Sparse Row (CSR) format, which potentially leads to a substantial decrease in computational time and speeds-up the neural networks [20][21][22][23].  Figure 2 shows an example of the CSR format when representing a sparse matrix. Figure 2(a) shows a sparse matrix and its CSR representation is shown in Figure 2(b). CSR is a popular and general-purpose sparse matrix representation. The matrix is stored using three arrays, which are the row pointer array, the column indices array, and the data values array [23]. The pointer array stores the pointers to the beginning of every row, the column indices stores the corresponding column indices, and the data value array stores the nonzero values, as illustrated in Figure 2. The row pointer array begins with the value of 0, for the first row, the first, the 2 nd , and the 3 rd column of the sparse matrix have the values of respectively 1, 1, and -1, presented in the column indices array and data values in Figure  2(b). The second row of the sparse matrix is represented by the second element in the row pointer array, which has the value of 3. The value of -1 in the second row is represented by the column indices of 1 and the data value of -1, as shown in Figure 2(b). The sparse matrix in Figure 2(a) can be compressed by using the arrays in Figure 2(b). By doing this, the memory and time consumption for matrix multiplication are significantly reduced.
During the forward-pass propagation, the neuron's output is calculated by using matrix multiplication. Assume that x = [x 1 , x 2 , ..., x n ] is the 1×n input vector, Wh is the m×n inputto-hidden layer weight matrix, h = [h 1 , h 2 , ..., h m ] is the output vector of the hidden layer, the forward-pass propagation performs the below computational task: Equation (4) consumes more power and time as we increase the size of the input vector and the weight matrix. If the weight matrix is sparse and represented by CSR, we can replace the matrix multiplication in (4) with the Sparse matrix-vector multiplication, which is faster than traditional matrix multiplication [20]. In this work, we represent weight matrices using sparse matrices and compress them with CSR. The sparse matrix multiplication is performed by using the Scientific Python (Scipy) library to save computational time [24].

III. EXPERIMENTAL RESULTS
A three-layer ternary neural network was deployed on a lowpower Raspberry Pi board for the application of image recognition. The network was trained and tested on the MNIST dataset for recognizing images of handwritten digits [25]. The input layer has 784 units corresponding to 784 image pixels. The inputs are binary. The hidden layer has 512 neurons and the output layer has 10 neurons for recognizing 10 digits. The network is trained using Stochastic Gradient Descent with the Momentum method. The real-valued weights are transformed to binary weights using the proposed adjustable dynamic threshold. By adjusting the variable in (3), we achieved the recognition rate with varied sparsity of quantized weight matrices, as presented in Figure 3. In Figure 3, the sparsity of the quantized weight matrix is varied from 0.1 to 0.6 by adjusting λ, as explained above. The recognition rate slightly degraded as the sparsity of the quantized weight matrix increased. When the sparsity of the quantized weight matrices was as small as 0.1, the ternary neural network produced a recognition rate of 91%. When the sparsity of the quantized weight matrix increased to 0.6, the recognition rate was reduced by 3%. The results indicated that increasing the sparsity of the quantized weight matrix led to a small decrease in accuracy. In this work, the training is performed on edge device, Raspberry Pi board, and the network is simply constituted of an input layer, a hidden layer, and an output layer with the weights quantized to -1, 0, and +1. The ternary weights are obtained by the proposed dynamic threshold quantization with controllable output sparsity. The accuracy of the ternary neural network is slightly lower than the fullprecision neural network, however it has the advantages of less memory usage and faster inference time. More importantly, the proposed ternary neural network is promising for low-cost edge devices. Sparse quantized weight matrices were represented using the CSR format, which significantly enhances the inference processing. With a fixed-size sparse quantized weight matrix, the higher sparsity results in the smaller size CSR representation and faster CSR matrix multiplication.   pass propagation time required to propagate one image from the input layer to the output layer, which is also the time for predicting one image. For the uncompressed quantized weight matrices, the forward pass propagation takes 87.28ms, and such inference time does not depend on the sparsity of quantized weight matrices. Figure 4 shows the inference time with varied sparsity of quantized weight matrices when the quantized weight matrices were compressed with CSR. The matrix multiplication is performed using the Scientific Python library for the input vector and CSR arrays. For a sparsity of 0.1, the forward-pass propagation takes 20.606ms, which 4.24 times faster than the ternary neural network with uncompressed quantized weight matrices. More interestingly, when the sparsity increases, the array size of CSR representation for sparse matrices is more reduced, resulting in faster multiplication. For a sparsity of 0.6, the inference time of the compressed-weight-matrix ternary neural network is 9.335ms, which is 9.35 times faster than the original ternary neural network.
Quantizing the weight matrix is one of the techniques that are suitable for deploying DNNs on low-cost computers. Quantized neural networks can save the required memory for storing the model's parameters and internal parameters, and implementing faster than full-precision neural networks for speech and image recognition, as presented in [19]. In this work, we propose a method to control the sparsity of quantized weight matrices during the training process and compress the weight matrices using CSR representation. The high sparsity of quantized weight matrices sacrifices little accuracy, but speeds up the ternary neural network by a factor that reaches 9.35. The proposed idea is deployed on a simple 3-layer neural network for hand-written character recognition. Utilizing high sparsity quantized weight matrices and CSR makes the ternary neural network possible to implement on low-cost, low-power embedded systems such as general-purpose Raspberry Pi 3 board.
IV. CONCLUSION In this paper, we proposed a quantization function that can control the sparsity of quantized weight matrices for ternary neural networks. The sparsity of quantized weight matrices of the ternary network varied from 0.1 to 0.6 when the ternary neural network was trained with the MNIST dataset. The obtained recognition rate varied from 91% to 88%. The sparse weight matrices were compressed using the CSR format. The ternary neural network with compressed weight matrices was 4.24 times and 9.35 times faster than the original ternary neural network, when the sparsity of the quantized weight matrices was 0.1 and 0.6 respectively. Ternary neural networks with compressed quantized weight matrices are suitable for implementation on low-power embedded systems for the application of image recognition.