Real-Time Distracted Drivers Detection Using Deep Learning

: In the last few years, the number of road accidents is increasing worldwide. According to the World Health Organization the most common cause behind these accidents is driver’s distraction and in many cases is caused by the use of a mobile phone. An attempt to develop a system for detecting distracted drivers and warn the responsible person against it was done. The system is a CNN based system that detects and identifies the cause of distraction. The base architecture for the CNN is VGG-16 and is modified for this task. Various activation functions (Leaky ReLU, DReLU, SELU) were used in order to investigate performance. Also, the performance of a lightweight attention module (squeeze-and-excitation) was evaluated. Experimental results show that the system outperforms earlier lightweight models in literature achieving an accuracy of 95.82%.


Introduction
According to the World Health Organization (WHO) report [1], 1.35 million people worldwide die in traffic accidents each year. That is nearly 3 700 people dying on the world's roads every day. "One of the most heart-breaking statistics in this report is that road traffic injury is the leading cause of death for people aged between 5 and 29 years" [1]. The report also shows that the total number of deaths increases from year to year and the most common cause behind these accidents is driver's distraction. The use of a mobile phone while driving is widespread among young and novice drivers, adding further to the already high risk of crash and death among these groups. Telephone use while driving increases the likelihood of being involved in a crash by a factor of four, while texting increases crash risk by a factor of 23. Also, the drivers' reaction times have also been shown to be 50% slower with telephone use than without. The National Highway Traffic Safety Administrator of United States (NHTSA) reported in 2016 the death of 3450 people and 391.000 people injured in car accidents due to distracted drivers [29]. The same report states that 481.000 passenger vehicles are driven by people using handheld cell phones during the day. In the United States, around 10 people are killed and more than 1000 are injured in road crashes that are reported to involve a distracted driver. Neither in Romania the situation is not better. The National Institute of Statistics (INS) had reported 1951 deaths and 40211 injured in road accidents with distracted drivers [15]. The number of fatalities reported to involve distracted drivers is 9 times higher in Romania than in USA reported to the population size. According to NTHSA distracted driving can be defined as "any activity that diverts attention of the driver from the task of driving" and can be classified into Manual, Visual and Cognitive distractions [5,29]. Some examples of cognitive distractions are daydreaming and lost in thoughts. Manual distractions include talking or texting using mobile phones, eating, talking to passengers in the vehicle, drinking etc. and an example of visual distraction is sleepiness.
Nowadays, an increasing number of modern vehicles have Advanced Driver Assistance Systems (ADAS) such as stability control, traction control, lane departure warning, adaptive cruise control and anti-lock brakes. These systems are developed to prevent accidents by providing technologies that warn the driver of possible problems and keep the driver and passenger safe in the event of an accident. But even the latest autonomous vehicles today are not fully autonomous and require the driver to be careful and ready to take control of the steering wheel in an emergency. There are 5 levels of automated driving, of which two are considered autonomous. Most of the self-driving cars fall into category level 2 or 3 which means that a human driver must be ready to intervene when requested and must not be distracted. An example of a system being developed that falls into level 4 category is the Waymo self-driving cab service. There were a few selfdriving car fatalities like Tesla autopilot's crash with the white truck-trailor in Williston, Florida in May 2016 and Uber's self-driving car with an emergency driver behind the wheel, hit and killed a pedestrian in Arizona in March 2018. In both of these fatalities, the driver could have avoided accidents, but the evidence shows he was clearly distracted. This makes distracted driver detection an essential part of the car and can lead to the development of a new ADAS system. Detecting driver inattention is extremely important for additional prevention measures. If the vehicle could detect such distractions and then warn the driver against it, or send warning messages to the headquarters if the drivers is a professional driver, or to the insurance company than the number of road accidents can be reduced, bad habits can be detected for professional drivers and a more personalized insurance policy can be created for the vehicle.
The focus of this paper is detecting driver distractions. A Convolutional Neural Network approach is presented for this problem with different hyperparameters and activation functions. It is also attempted to introduce and attention module without adding additional computational complexity, memory impact while maintaining good accuracy.

Related Work
This section reviews the relevant and significant works in the literature to detect distracted drivers. According to NHTSA, the main cause of distractions is the use of mobile phones [29]. Motivated by the same thing, some researchers have tried to detect the use of mobile phone while driving. In 2011, Zhang et al. created a dataset using a camera and used the Hidden Conditional Random Fields model based on face, mouth, and hand features to detect the use of the mobile phone [43]. In 2015, Nikhil et al. created a data set for hand detection in the vehicle environment using the Aggregate Channel Features (ACF) object detector and achieved an average precision of 70.09% [8]. Seshadri et al. [35] created their own data set to detect mobile phone usage and used the Histogram of Gradients (HoG) [7] method and an AdaBoost [9] classifier and obtained a classification accuracy of 93.9%. Le et al. using the dataset above, achieved higher accuracy that is, 94.2% using the Faster-RCNN [33] deep learning model. The system is slow and their approach is based on face and hand segmentation to detect the use of the mobile phone and locate the hands on the steering wheel [21].
A significant contribution in this area has been done by University of California San Diego's Laboratory of Intelligent and Safe Automobiles but has dealt with only three types of distraction: radio tuning, mirror adjustment and operating gear. Martin et al. presented a vision-based analysis framework that recognizes activities in the vehicle using two Kinect cameras [27]. Ohn-bar et al. proposed a fusion of classifiers where the image is to be segmented in three regions: steering wheel, gearbox and dashboard to infer real activity [30]. They also presented a region-based classification approach to detect the presence of hands in certain predefined regions in an image [31] and expanded their research to include the characteristics of the eyes [32]. However, they only considered three types of distractions.
Zhao et al. created a more inclusive driving dataset from the driver's side position, considering four activities: driving safely, operating shift lever, eating and phone calls [44]. Authors achieved an accuracy of 90.5% using random forest and contourlet transform. The authors also proposed a system that uses PHOG and multilayer perceptrons that provide an accuracy of 94.75% [45]. In 2016, Yan et al. presented a solution based on convolutional neural networks that achieved a 99.78% classification accuracy [41]. Other CNN solutions are presented in papers [18,28,36] and using different datasets.
The earlier data sets have focused only on a limited set of distractions and many of them are not publicly available. StateFarm's distracted driver detection competition on Kaggle defined ten postures to be detected [38]. This was the first set of data to consider a wide variety of distractions and was available to the public. Many approaches proposed by researchers were based on traditional hand extractors such as SIFT [26], SURF [4], HoG [7], combined with classical classifiers such as SVM [11], BoW, neural networks. However, CNNs have proven to be the most effective techniques to obtain a high accuracy [12]. But, according to the rules and regulations, the use of the data set is limited to the purpose of the contest. In 2017, Abouelnaga et al. created a new dataset similar to the StateFarm's dataset for detecting driver distraction [2]. The authors proposed a solution using a weighted assembly of five different Convolutional Neural Networks. The system achieved good classification accuracy, but it's too complex for real-time detection. Baheti et al. addressed this problem of complexity and reduced the number of parameters significantly and achieved an accuracy of 95.54% [3].

Experiment Settings
The classification performance is evaluated on two convolutional networks which have the same base architecture with different activation functions. Due to the large number of architectures, a well-known deep neural network was used as a base architecture and same hyper parameters for different activation settings and layer orders.

Dataset Description
In this paper, the dataset used was created by Abouelnaga et al. [2]. The dataset includes ten classes: sending messages on mobile phones using the right or left hand, talking on mobile phones with the right or left hand, adjusting the radio, eating or drinking, hair or makeup, turning back and talking to the passenger. Examples of images of each class of the dataset are shown in figure 1. The data was collected in different driving conditions from thirty-one participants from seven different countries. The data set comprises 17308 images divided into a training set (12977) and a test set (4331).

Original VGG-16 Architecture
Deep convolutional neural networks [19,22] have led to a series of breakthroughs for image segmentation [34], image classification [37], natural language processing and many more. Since 2012, there has been a rapid progress in research and applications of CNNs because of availability of large amounts of data and the computing power. Various architectures like AlexNet [20], VGGNet [37], ResNet [10], U-Net [34] became well known. In this paper, the modified VGG-16 architecture proposed by Baheti et al. [3] was explored and modified in order to try to improve the accuracy.
VGG Net is one of the well-known CNN architectures from literature. It is well known because it is simple and deep and it worked well on image classification and image localization tasks. It is also used as a backbone or as a part of other architectures like TernarusNet [14] or S3FD [42]. The VGG-16 architecture is shown in figure 2. VGG uses 3×3 filters in all convolutional layers, 2×2 max pooling with stride 2, ReLU activation function and categorical crossentropy loss. Initial layers of the CNN act as a feature extractor and the fully connected layers act as a classifier which classifies the input images into the predefined classes.
The original model has 1000 outputs which correspond to 1000 object classes of ImageNet. In order to adapt the model to our dataset the output number needs to be modified to 10.

Activation Functions
In this section, three kinds of rectified liner units are introduced: leaky rectified liner unit (Leaky ReLU), displaced rectified liner unit (DReLU) and scaled exponential liner unit (SELU). x was used to denote the input and y to denote the corresponding output after passing the activation function. In the following subsections, each rectified unit is introduced.

Leaky Rectified Linear Unit
Leaky Rectified Linear Unit was first introduced by Maas et al. [24]. Formally we have: , 0 , 0 where a is a fixed parameter in range (1, +1). In the original paper, it is suggested to set a to a large number.

Displaced Rectified Linear Units
Displaced Rectified Linear Unit [25] is a generalization of both ReLU and SReLU [6] by allowing its inflection point to move diagonally from origin to any point of the form (-,-). The following equation defines DReLU: , , If =0, DReLU becomes ReLU. If =1, DReLU becomes SReLU. Based on experimental results, authors suggest to set =0.05.

Scaled Exponential Linear Units
Scaled Exponential Liner Units [17] are some king of ELU [6] but with two parameters. Mathematically, we have , 0 , 0 and are two fixed parameters, meaning the model will not backpropagate through them and they are not hyperparameters to make decisions about. and are derived from the inputs and are calculated as described in the paper. For standard inputs with mean 0 and standard deviation 1 the values are: =1.6732, =1.0507.

Modified VGG-16 Architecture
Baheti et al. modified the original VGG-16 architecture in order to reduce the total number of parameters and used various regularization techniques in order to reduce the generalization error [3]. They replaced the fully connected layers with convolutional layers because dense layers are computationally too expensive and consume most of the network parameters. In order to reduce the overfitting, they used batch normalization [16] and L2 Weight regularization as regularization techniques. They also used dropout between the group of layers as a way of reducing the overfitting by randomly dropping out some neurons in training phase.
The first modification is removing dropout. In the batch normalization paper, it is said that "the resulting networks can be trained with saturating nonlinearities, are more tolerant to increased training rates, and often do not require Dropout for regularization" [16]. Since convolutional layers have few parameters, they need less regularization to begin with. Furthermore, because of the spatial relationships encoded in the feature maps, activations can become highly correlated and this makes dropout ineffective.
The second modification that was done is adding an attention mechanism. Attention can be interpreted as a way of limiting the allocation of computational resources to the most informative components of a signal [39]. Attention mechanisms have demonstrated their utility across many tasks including image segmentation [23], image captioning [40] and others. A research for a lightweight gating mechanism that is computationally efficient was done in order to preserve the small number of parameters. This mechanism is called Squeeze-and-Excitation blocks [13].
The modified network architecture is shown in figure 3 and in some of the experiments, after each group of convolutional layers a squeeze-and-excitation module was added. Also, the L2 regularization was removed.

Results and Discussion
In this paper a system for distracted driver detection based on convolutional neural networks was designed. The weight initialization was done by using ImageNet pre-trained model and for the new layers random normal distribution was used. Training and testing is carried out using two NVIDIA 1080 Ti GPUs with 11GB RAM each. The batch size was 32 per GPU and the number of epochs was 32. The training was done using Stochastic Gradient Descent with a learning rate of 10 -4 , decay rate of 10 -6 and momentum of 0.9. The framework used for model training and evaluation is Tensorflow.

Conclusions
Driver distraction is a problem leading to a large number of road crashes and many deaths worldwide. Therefore, distracted driver detection becomes an essential component of the ADAS systems. In this paper, a convolutional neural network able to detect distracted drivers and also the cause of distraction was presented. A variation of VGG-16 architecture proposed for this task was modified and several activation functions and an attention mechanism were applied in order to try to increase the accuracy. With the accuracy 95.82% the proposed system outperforms the thinned version of a previous paper.
Incorporating temporal context may help in reducing the classification error. Also, in future, introducing more features like eyes orientation, head orientation correlated with signals received from the car may help in detection of cognitive distractions and visual distractions like sleepiness.