Shallow SqueezeNext: Real Time Deployment on Bluebox2.0 with 272KB Model Size

The significant challenges for deploying CNNs/DNNs on ADAS are limited computation and memory resources with very limited efficiency. Design space exploration of CNNs or DNNS, training and testing DNN from scratch, hyper parameter tuning, implementation with different optimizers contributed towards the efficiency and performance improvement of the Shallow SqueezeNext architecture. It is also computationally efficient, inexpensive and requires minimum memory resources. It achieves better model size and speed in comparison to other counterparts such as AlexNet, VGGnet, SqueezeNet, and SqueezeNext, trained and tested from scratch on datasets such as CIFAR-10 and CIFAR-100. It can achieve the least model size of 272KB with a model accuracy of 82%, a model speed of 9 seconds per epoch, and tested on the CIFAR-10 dataset. It achieved the best accuracy of 91.41%, best model size of 0.272 MB, and best model speed of 4 seconds per epoch. Memory resources are of high importance when it comes down to real time system or platforms because usually the memory is quite limited. To verify that the Shallow SqueezeNext can be successfully deployed on a real time platform, bluebox2.0 by NXP was used. Bluebox2.0 deployment of Shallow SqueezeNext architecture achieved a model accuracy of 90.50%, 8.72MB model size and 22 seconds per epoch model speed. There is another version of the Shallow SqueezeNext which performed better that attained a model size of 0.5MB with model accuracy of 87.30% and 11 seconds per epoch model speed trained and tested from scratch on CIFAR-10 dataset.


Introduction
DNN model performance is very critical to ADAS and UAV applications safety. DNNs overcame the limitations of its traditional counterparts which are more memory and computationally expensive algorithms. Here, DNN model performance refers to model accuracy, model memory size, and model speed. Due to more intensive DSE of CNNs in small and macro-CNN architectures, especially for ADAS or real-time embedded systems [10], new architectures were introduced such as SqueezeNet [1] and SqueezeNext [2] baseline architectures, efficient and better than the traditional architectures [8,10,15]. DNNs are usually trained and tested on some widely available datasets such as MNIST, CIFAR-10, COCO, ImageNet, etc. DNNs usually comprise of four elemental layers such as activation, convolution, pooling, and fully connected layers. The convolution operation is performed with striding and padding values, the default value of striding is 1 and the padding used is zero-padding to maintain the spatial dimension of the DNN. Different optimizers [7] and learning rate scheduling methods are implemented. To improve DNN architectures, we perform Design Space Exploration (DSE) of DNNs, architecture modification, hyperparameter tuning, and tweaking [4,5,11,[15][16][17][18]. In this paper, the proposed architecture is implemented on datasets such as CIFAR-100 and CIFAR-10 [9], initially, trained and tested on a GPU and later, was deployed on Bluebox2.0 [11,16], real time embedded platform by NXP. This research complies Design Space Exploration of DNNs for Shallow SqueezeNext architecture with the help of insights from the following research papers [16][17][18]. In the end, we deploy the Shallow SqueezeNext with the help of RTMaps [19] on the Bluebox2.0 with 272KB Model Size real time platform, bluebox2.0 by NXP.

SqueezeNet
SqueezeNet architecture [1] comprises of convolutions layers, max pooling layers, fire modules, ReLU and ReLU in place activations, and softmax activation layers. Fire module is the mainstay of SqueezeNet architecture. This module consists of these two following layers, one squeeze layer; s2 (1x1)) and two expand layers; e1 (1x1) and e3 (3x3). They are further, responsible for model size or model parameter reduction and better model speed performance. The three key design strategies implemented to develop this architecture: 1) Replacing the 3x3 convolution layers with 1x1 convolution layers. 2) Reducing number of the input channel to 3x3 convolution layers. 3) Down-sampling or perform max pooling late down in the CNN network.
In comparison, to VGG architecture, it performs better in terms of model size and speed. It reduced the VGG architecture model size from 385MB down to the model size of 0.5MB with an accuracy tradeoff. Additionally, there is a colossal decrease in the parameter count, further, leading to a better model speed for processing per epoch of the SqueezeNet model.

SqueezeNext
SqueezeNext baseline architecture [2] consists of the following key factors: 1) Better channel reduction by incorporating a two-stage squeeze module subsequently reducing parameters significantly with the help of 3x3 convolutions. 2) It uses separable 3x3 convolutions for model size reduction, and removal of 1x1 convolution after the squeeze module. 3) It incorporates an element-wise addition skip connection identical to ResNet.  [6,6,8,1] baseline SqueezeNext configuration for CIFAR-10. Figure 1. illustrates a modified version of the baseline SqueezeNext architecture implemented in Pytorch framework which is trained and tested from scratch on datasets such as CIFAR-100 and CIFAR-10. The baseline SqueezeNext is formed by four stage implementation of bottleneck modules, skip connections, ReLU and ReLU (in-place) layers, batch normalization, spatial resolution layer, max pooling layers, and a fully connected layer. Within baseline SqueezeNext, bottleneck modules are majorly responsible for the rigorous parameters' reduction [14][15][16]. It comprises of the white block ( Figure 1. grey block), the first convolution for the input channel taking in a 3-channel feature map. The consecutive output of the first convolution becomes the input for the subsequent four-stage configuration implementation of the architecture, baseline SqueezeNext. The sequence of different colored blocks (dark blue, blue, orange, and yellow blocks) in Figure 1. succeeding the first convolution (white block) illustrates the four-stage configuration implementation belonging to Shallow SqueezeNext which depicts low level, medium level, and high-level features, respectively. They also depict a change in the resolution of the input feature map in the baseline SqueezeNext architecture. The fact here, is that a smaller number of initial blocks, low-level features holds the redundant information in contrast to mid or high-level features later down the CNN which carries most of the useful feature map information data.

Modified SqueezeNext
Modified SqueezeNext architecture is developed for the purpose of this research for unbiased comparison between the proposed Shallow SqueezeNext architecture and Modified SqueezeNext implementation based on Pytorch framework. It also assisted in providing great insights for the possible domains of improvement within the baseline architecture and to further, explore the baseline SqueezeNext. Modified SqueezeNext architecture was built out of the basic block illustrated in the Figure 2 (right), which are arranged in a structural form of two block structures, represented in Figure 3.    In Figure 3, initially, both block structures (left & right) begin with an output of input data being fed into the Figure 2. right basic block (Modified SqueezeNext basic block) which further fed the input to the max pooling layer. The block structure on the right, depicts the first individual initial blocks implementation of the four-stage configuration. It represents, the first dark blue, blue, orange and the last yellow block of the four-stage configuration.
The block structure on the left, forms each of the remaining blocks of the four-stage configuration of the Modified SqueezeNext. For the fair and unbiased comparison with the proposed architecture, all the architectures are trained and tested in Pytorch only with datasets such as CIFAR-100 and CIFAR-10, respectively.

Architecture Tuning
A recently introduced optimizer and some other activation functions [4] had been used for experiments on the proposed Shallow SqueezeNext Architecture, further, fine tuning and tweaking the proposed architecture.

Adabound
Adabound [12], a newly introduced optimizer which employs bounds on their learning rates dynamically and achieving a transition. It shows good results with the benefits of adaptive methods. The lower and upper bound of it will adjust after running the CNN/DNN for several epochs (in proposed architecture case it was between 60 to 90 epochs) so that it transforms from Adam to SGD. The default hyperparameters for it are learning rate of 0.001, beta1 = 0.9 and beta2 = 0.999. It was seen that the optimizers such as adagrad, adam, and rmsgrad seem to perform better in training, initially. When the learning rates are decayed, SGD begins to outperform. But, in the case of adabound, it converges fast and achieves a bit higher accuracy than SGD.

Rectified Linear Units (ReLU) in Place
RELU-in place is not a linear activation function layer, but it provides similar advantages as of ReLU, additionally with a better performance. It modifies the input directly without allocating any additional output. It is observed to save some amount of memory in comparison to RELU. It cannot be used all the time as it needs a valid operation or valid use case.

Exponential Linear Units (ELU) in Place
ELU is an activation function, converging to zero cost faster and then, producing better and more accurate results. The curve for this activation function will smooth over time, slowly. It also has another special operation case, that is, ELU (in-place). All in-place are observed to save memory, further not allocating any additional outputs which is huge benefactor for a CNN/DNN model.

BlueBox2.0 by NXP
Bluebox2.0 [19] is the second version of real time deployment platform for autonomous driving applications. It provides automotive reliability, functional safety, and freedom to implement the algorithms on frameworks such as Pytorch, TensorFlow and Keras. The recent edition of bluebox2.0 incorporates three essential components are S32V234 (vision processor), LS2084A (embedded compute processor), and S32R27 (radar). It is operated with the help of Linux BSP image on a 16GB microSD card. For deployment of the CNNs/DNNs or the proposed Shallow SqueezeNext architecture it makes use of RTMaps framework [19], another tool used with bluebox2.0 for the architecture deployment.
RTMaps: Real-time Multisensor applications is easy to use, efficient and robust real-time embedded systems. It is designed for developing multimodal based applications, testing, benchmarking, validation, and execution. It consists of four key modules that are RTMaps Runtime Engine, RTMaps Component Library, RTMaps Studio, RTMaps Embedded. The connection between the computer running RTMaps and the remote studio RTMaps on bluebox2.0 can be accessed via a static TCP/IP connection. Architecture Deployment: To deploy a Pytorch code with the help of RTMaps for bluebox2.0, it must consist of three key functions to make it work in RTMaps. Three function definitions are birth (), core (), and death () [16,19]. Pytorch deployment with the help of RTMaps on bluebox2.0 for the Shallow SqueezeNext architecture is shown in Figure 4. The connection between the RTMaps studio with remote connection to embedded platform on a PC and real-time platform with Ubuntu BSP image, bluebox2.0 by NXP can accessed via TCP/IP, illustrated in Figure 5.

Shallow SqueezeNext
Shallow SqueezeNext architecture is a shallow (refers to not too deep or small DNN models) and compact DNN architecture. The motivational architectures behind this proposed architecture were SqueezeNext [2], SqueezeNet [1], and MobileNet [3] architectures. During the research, another architecture was developed for better accuracy with model size tradeoff, that is basically a deeper or more comprehensive version of it, High Performance SqueezeNext [17]. Shallow SqueezeNext architecture is made up of bottleneck modules [2] further, consisting of the basic blocks mentioned below in Figure 6. These basic blocks are arranged in a four-stage configuration implementation (Figure 7.) followed by a spatial resolution layer, dropout layer with probability; p equal to 0.3, average pooling, and a fully connected layer.  It is based on the following important strategies: 1) Managing depth and width scaling with resolution and width multipliers. 2) Use of only in-place operations in all layers except in the layers where we have a gradient change operation. Carefully, placing it between ELU in-place and batch normalization layer ( Figure 6). 3) Incorporating an element-wise addition skip connection to avoid vanishing gradient problem. 4) Addition of a drop out layer at the end of four stage configuration after the average pooling layer. 5) Reduction of max-pooling layers and replacing them with average pooling layers. As observed in Figure 7, average pooling layer after drop-out layer. The architecture implements the strategy of training and testing different optimizers. Figure 6. represents the basic block which is the fundamental building for the architecture with following layers convolution (1x1), ELU (in-place) [13], and batch normalization. Shallow SqueezeNext basic blocks together form bottleneck modules, illustrated in Figure 8. (left), these bottleneck modules are arranged in a four-stage configuration as shown in Figure 8 (right).
The basic blocks in Figure 6. and bottleneck modules four-stage configuration (

Resolution Multiplier
Resolution multiplier [3] is the first hyper-parameter used to reduce the computational resource usage belonging to a CNN/DNN. It is another important parameter which have a significant effect on the parameter reduction and apparently, effect the scaling size of the model. This is responsible for reduced size and parameter for the Shallow SqueezeNext architecture.

Width Multiplier
Width multiplier [3] is the second hyper-parameter used to develop small, compact, and less expensive DNN models in terms of computation and memory resource usage. It develops a uniformly thin deep neural network at each layer, further, helping to reduce the computational expenses and number of parameters by a power of two of the width multiplier term.

Shallow SqueezeNext Results
Shallow SqueezeNext architecture was implemented with the approaches mentioned in the literature review section, leading to various number of models of the proposed architecture. The model size ranges from 4.2MB to a small size of 115KB or 0.115MB as shown in Table 3 with mostly model accuracy above 80% and model speed of approximately under 15 seconds per epoch for the experimental models. In the following tables, only few of the several better model's results out of total 600 models or experiments are being discussed below. The nomenclature for the proposed Shallow SqueezeNext models and results from the tables within this section represents Shallow SqueezeNext architecture version name followed by resolution multiplier, and width multiplier. We can infer from Table 1. that a better reduced Shallow SqueezeNext model size is achieved that is 272KB or 0.272 MB, (Shallow SqueezeNext-06-0.4x model) from the 9.525MB, baseline SqueezeNext model size. Shallow SqueezeNext-06-0.4x model is 35x smaller than SqueezeNext-23-2x, 10x smaller than SqueezeNext-23-1x and approximately, 11x smaller than SqueezeNet v1.0 and SqueezeNet v1.1. Implementation of in-place activation functions, elimination of the extra max-pooling layers and with the introduction of the suitable resolution and width multipliers made the proposed architecture more compact, efficient, and flexible. With the change of resolution and width multiplier, the proposed Shallow SqueezeNext architecture can be deployed with better accuracy but with a trade-off of memory size and memory speed. Shallow SqueezeNext hyperparameters for each variation of model was saved with a Pytorch function, save (). The checkpoint is then, loaded with the help of Pytoch function, load () which is subsequently utilized for the training the architecture. This step of saving and loading the checkpoint is critical for the success of the Shallow SqueezeNext because not all hyper-parameters are saved and loaded but just the important ones. The generated model checkpoint file size is used to determine the model size and final average accuracy. This checkpoint file is again utilized for the testing Shallow SqueezeNext architecture deployment on Bluebox2.0 by NXP [11,16].   The benefit of this proposed architecture is that it can be readily implemented on real-time systems, BlueBox2.0 by NXP [16,19] with limited memory with the help of dropout layer [6]. Table 4. illustrates the results attained with the different values of dropout layer probabilities for Shallow SqueezeNext justifying dropout with probability value, p = 0.3 or 0.4 is a better default value for the proposed architecture. Table 5. represents the additional results for the Shallow SqueezeNext [9].   In Table 6, all results have a unique behavior illustrating the effect of different optimizers [7] and ELU [13] implementation on the proposed architecture.
Also, from the above-mentioned tables the inference can be that deep residual layer [4,8,13,15] effects the tradeoff between model accuracy, model speed, and size of the proposed Shallow SqueezeNext architecture. Figure 9. (a-c) illustrates the baseline SqueezeNet, baseline SqueezeNext and the proposed architecture Shallow SqueezeNext accuracies trained on the CIFAR-10 dataset. The graph comparison between the Figures 9 (a), (b) & (c) illustrates, less overfitting in Figure 9 (c) depicted by the empty space or gap between training and validation curve in comparison to (a) & (c). These curves approach to 1.0 quickly. This validates the proposed architecture performs better in terms of model parameters (model accuracy, model speed and model size) than the SqueezeNext and SqueezeNet baseline model which is trained and tested from scratch on CIFAR-10 and CIFAR-100 datasets.

Bluebox2.0 Implementation Results
The Shallow SqueezeNext architecture is finally deployed on bluebox2.0 by NXP to verify and validate the efficiency and integrity of the Shallow SqueezeNext architecture [16]. The Pytorch generated checkpoint files were trained on datasets such as CIFAR-100 and CIFAR-10 with the help of RTX 2080ti GPU and then, deployed and tested on bluebox2.0 by NXP. The deployment of the Shallow SqueezeNext is shown in Figure 4. The result comparison of the Shallow SqueezeNext is shown below in Table 7. Figure 10. illustrates the Shallow SqueezeNext deployment results attained by training the architecture on RTX 2080ti GPU with CIFAR-10 dataset from scratch and test the architecture by deploying it with the help of RTMaps on a real-time development platform, bluebox2.0 by NXP.

Conclusion
In this paper, based on the insights from the existing CNNs/DNNs and methods such as fine hyperparameter tuning (refers to implementation of different optimizer with step size decay learning rate scheduling, using momentum and nestrov with SGD optimizer, tuning the parameters for normalization and data preprocessing), training the proposed architecture from scratch with no transfer learning, using comparatively small datasets, and architecture modifications the proposed Shallow SqueezeNext architecture is introduced.