An improved MobileNetV2 architecture for efficient roadside accident detection in CCTV footage

Muhammad Zakaria Eskief ¹*; Nizar Al Issa ¹; Nareman Shouki ¹; Mao Chenghui ²

1, Department of Economics and International Business Administration, Higher Institute for Administrative Development, Damascus University, Damascus 22743, Syria

2, School of Traffic & Transportation Engineering, Central South University, Changsha 410083, Hunan, China

E-mail:
zakariaeskif4@gmail.com

Received: 23/04/2025
Acceptance: 07/06/2025
Available Online: 08/06/2025
Published: 01/07/2025

Download this article

Manuscript link
http://dx.doi.org/10.30493/DAS.2025.518818

Abstract

Facilitating prompt responses to roadside incidents is essential for improving traffic safety and preserving lives. This work introduces an innovative method utilizing the lightweight and efficient MobileNetV2 architecture for CCTV accident images classification. For that purpose, an image library comprised of 995 static CCTV footage images, consisting of 469 accident images and 526 non-accident images was used. This dataset was split into a training (80%) and testing (20%) datasets. The trained model attained training and validation accuracies of 93.5% and 89%, respectively, underscoring the possibility of employing analogous deep learning models in traffic surveillance systems to enhance roadside safety and emergency response efficacy. The presented system is engineered to autonomously binary classify accidents and non-accident images with high precision and low computational resources, rendering it suitable for implementation in resource-limited settings. Nevertheless, additional hyperparameter optimization, along with dataset augmentation, is essential for enhanced performance.

Keywords: Accident detection, MobileNetV2, CCTV, Deep learning

Introduction

Roadside accidents constitute a major public safety issue, resulting in serious injuries and fatalities globally [1][2]. Timely discovery and swift rescue efforts are essential to alleviate the consequences of such occurrences, potentially saving numerous lives and diminishing the severity of injuries [3-6]. Conventional accident detection approaches frequently depend on manual reporting, which may be tardy and unreliable [6]. In the era of smart cities, utilizing technology for swift accident detection is becoming increasingly essential [7][8]. The integration of real-time surveillance with automated detection systems has the potential to transform emergency response by ensuring the fast identification and resolution of incidents [9][10]. This integration enhances public safety, traffic management, and overall urban mobility.

CCTV cameras are prevalent in contemporary metropolitan settings, offering comprehensive surveillance of roadways and junctions. Leveraging these existing infrastructures for accident detection is both economical and efficient. By employing contemporary artificial intelligence methods, it is feasible to transform these passive monitoring instruments into active surveillance devices that perpetually examine video feeds for indications of accidents [11]. This method allows for the instantaneous identification of occurrences, guaranteeing that emergency services are promptly notified, therefore expediting rescue operations and minimizing reaction times [12][13].

Notwithstanding the presence of CCTV cameras, conventional accident detection methodologies encounter numerous constraints. Manual surveillance of video feeds is arduous and susceptible to human mistake, resulting in delayed reactions [14][15]. Moreover, traditional algorithms frequently have difficulties in precisely detecting accidents owing to the intricacy and variety of real-world situations. Environmental issues, like inadequate lighting, adverse weather conditions, and obstructions, further intensify these challenges. As a result, numerous incidents remain unrecognized or are reported belatedly, undermining the efficacy of rescue operations and heightening the likelihood of adverse consequences [16][17].

Deep learning provides a revolutionary solution for the shortcomings of conventional accident detection systems. Convolutional neural networks (CNNs) trained on extensive datasets of accident scenarios can discern minor patterns and abnormalities that signify an accident. Numerous researches have proven the efficacy of deep learning in improving roadside accident detection and prompt rescue efforts [18-20]. Prior research has concentrated on several neural network topologies and training techniques to enhance detection precision and minimize false positives. These initiatives have demonstrated encouraging outcomes, suggesting that deep learning might effectively offer a reliable solution for real-time accident surveillance. Nonetheless, additional research and development are required to enhance these technologies for extensive implementation and interaction with current urban infrastructure [21]. In that context, MobileNetV2, an efficient and lightweight convolutional neural network architecture, is ideally suited for this application, facilitating real-time processing of video streams without requiring substantial CPU resources [22]. This automation markedly improves the reliability and speed of accident detection, guaranteeing timely notification of emergency services.

Therefore, the current work was conducted to suggest a framework employing the MobileNetV2 architecture and to assess its efficiency in executing a binary classification task of accident / non-accident static CCTV images.

Material and Methods

The dataset

The dataset “Accident Detection from CCTV Footage” [23] comprises images utilized for training and testing models aimed at identifying accidents from CCTV footage, accessible on Kaggle (From Link). The dataset is divided into two categories: Accident (469 images) and Non-Accident (526 images). The photos are appropriately tagged, and the dataset is partitioned into training and testing subsets (Table 1).

An improved MobileNetV2 architecture for efficient roadside accident detection in CCTV footage — **Table 1.** Statistics of the dataset used in this study

The model

The assessed model is lightweight MobileNetV2 framework comprises three essential components: The Separable Convolution (SC) block (Fig. 1 A), the Inverted Separable Convolution (ISC) block (Fig. 1 B), and the Inverted Residual (IR) block (Fig. 1 C) [22].

The ISC block generates feature maps through Depthwise (DW) convolution, subsequently employing a linear 1×1 convolution to diminish the channel number. The process initially augments the number of channels in the input feature map with a 1×1 convolution, accompanied by batch normalization and ReLU6 activation (1×1 R-Conv). The construction of the IR block alleviates the gradient vanishing problem. This design is derived from the ISC block; however, to preserve identical feature map dimensions pre- and post-processing, the stride of the depthwise convolution is configured to (1). The residual feature maps are derived by summing the feature maps from the 1×1 linear convolution (1×1 L-Conv) with the original input feature maps.

The architecture of the present MobileNetV2 backbone, constructed with ISC and IR blocks, is seen in Figure 2. The MobileNetV2 model employs an input dimension of 512×512×3 and generates seven sets of feature maps. These feature maps are originally subsequently passed to a Mobile FPN model for feature fusion and motion detection [22]. The spatial size of these feature maps varies from 64×64×32 to 1×1×256. These feature maps correspond to intermediate layers of the network, with earlier layers producing larger spatial resolutions (e.g., 64×64×32) and later layers yielding smaller spatial sizes with higher channel depths (e.g., 1×1×256).

Since the aim of the current model was simply to classify the static images into accident (class 1) and non-accident (class 0) images, a classification head was added to the original architecture. This classification head consists of a Global Average Pooling (GAP) layer which condenses the spatial dimensions of the final feature map (1×1×256) into a singular vector of size 256, a Fully Connected Layer that transforms the 256-dimensional vector into a 2-dimensional vector (one neuron per class), and a Softmax Activation function that produces probabilities for the two classes (Accident and Non-Accident) (Fig. 2).

Training and evaluation

The model was executed and trained utilizing the TensorFlow framework, employing the ADAM optimizer with a learning rate set at 0.001 and a batch size comprising 32 images. An image resize function (512×512) was applied to the used dataset. Moreover, several augmentation techniques were employed, such as brightness adjustment (±20%), cropping (up to 5% with black or white replacement), and horizontal flipping, to improve the model’s generalization abilities and mitigate overfitting.

In order to assess the efficacy of the developed model, standard classification evaluation metrics, including Accuracy, Precision, Recall, and F1 Score, were utilized with the validation set.

Where: TP (True positive), TN (True negative), FP (False positive), and FN (False negative).

Results and Discussion

The training loss decreases sharply from the initial epochs (Fig. 3 A), reaching near-zero values by the 10th epoch and stabilizing thereafter. This rapid decline indicates that the model quickly learns the underlying patterns in the training data. Concurrently, the training accuracy increases steeply, approaching 1.0 and remains near-perfect throughout the remaining epochs. The high training accuracy coupled with the low training loss suggests that the model effectively captures the complexities of the training dataset.

Similarly, validation loss exhibited a downward trend, albeit with some fluctuations, particularly within the first 15 epochs (Fig. 3 B). These fluctuations can be attributed to the model’s adjustments as it generalizes to new, unseen data. Despite these variations, the validation loss stabilized later at a 0.2, indicating appropriate generalization. The validation accuracy closely mirrors the training accuracy, reaching approximately 0.89 early in the training process and maintaining a consistent 0.92 level in later training stages, which demonstrates the model’s robustness and reliability in unseen data.

The analysis of confusion matrices indicated that the trained model achieved an accuracy of 92% in classifying non-accident images and 95% in identifying accident images (Fig. 4 A), resulting in a final training accuracy of 93.5%. In the validation dataset (Fig. 4 B), the observed values were 90% for non-accident images and 88% for accident images, respectively. The findings indicate that the model demonstrates strong generalization capabilities when applied to previously unobserved data, exhibiting only slight reductions in performance between the training and validation stages.

The model demonstrates an overall validation accuracy of 89%, accompanied by average precision, recall, and F1 scores of 89.8%, 88% and 88.9%, respectively. The presented metrics demonstrate a well-rounded performance across the two classes, suggesting that the model effectively balances sensitivity in identifying accidents with specificity in recognizing non-accident scenarios.

The findings illustrate the efficacy of incorporating ISC and IR components within the MobileNetV2 framework for the classification of traffic scenes. Through the utilization of depth-wise separable convolutions and linear bottlenecks, the model attains an advantageous balance between computational efficiency and classification accuracy. This holds significant relevance in real-time applications like intelligent transportation systems, where both speed and reliability are of utmost importance [24].

The analysis of the confusion matrices indicates a persistently low incidence of misclassifications, accompanied by marginally elevated false negative rates during the validation phase relative to the training phase. This indicates a moderate degree of class imbalance or variability within the validation set, potentially introducing further challenges in the generalization process. Nonetheless, the model demonstrates a high degree of robustness across the two datasets, as indicated by the minimal performance disparity observed between training and validation accuracy [25].

The integration of the ISC block seems to improve feature extraction by separating spatial and channel-wise transformations, thus enhancing discriminative ability without a substantial rise in model complexity. In a comparable manner, the IR blocks enhance learning efficiency by employing residual connections and expanding dimensionality, thereby augmenting the model’s capacity to discern nuanced distinctions between accident and non-accident scenarios.

When evaluated against analogous lightweight architectures documented in existing studies, the proposed model exhibits commendable performance regarding both accuracy and efficiency [26]. Subsequent research may investigate additional optimization strategies, including neural architecture search and knowledge distillation, to enhance the practicality of deployment on edge devices. Furthermore, the feature maps obtained from the proposed model can be seamlessly incorporated into motion detection systems [22], facilitating a swift reaction in the event of road accidents. The results obtained substantiate the appropriateness of the modified MobileNetV2 architecture for practical implementations in traffic monitoring scenarios.

Conclusions

The modified MobileNetV2 architecture presented in this research exhibits swift convergence, attaining elevated training accuracy and consistent validation performance (89–92%), which suggests strong generalization capabilities. The amalgamation of ISC and IR components in MobileNetV2 significantly improves feature extraction capabilities while preserving computational efficiency, thereby making it an appropriate choice for real-time traffic monitoring. Enhancing model precision by utilizing larger datasets and incorporating the derived feature maps within motion detection frameworks could establish the proposed model as a promising approach for autonomous CCTV systems, where rapid accident identification is crucial.

References

Mohammed AA, Ambak K, Mosa AM, Syamsunur D. A Review of the Traffic Accidents and Related Practices Worldwide. Open Transp. J. 2019;13(1):65–83. DOI
Anjuman T, Hasanat-E-Rabbi S, Siddiqui CK, Hoque MM. Road traffic accident: A leading cause of the global burden of public health injuries and fatalities. InProc. Int. Conf. Mech. Eng. Dhaka Bangladesh 2020:29-31.
Dorn M, Shepherd S, Satterly S, Dorn C. Staying alive: How to act fast and survive deadly encounters. Sourcebooks, Inc.; 2014.
Dela Cruz OG, Padilla JA, Victoria AN. Managing Road Traffic Accidents: A Review on Its Contributing Factors. IOP Conference Series: Earth and Environmental Science. 2021;822(1):012015. DOI
James SL, Lucchesi LR, Bisignano C, Castle CD, Dingels ZV, Fox JT, Hamilton EB, Liu Z, McCracken D, Nixon MR, Sylte DO. Morbidity and mortality from road injuries: results from the Global Burden of Disease Study 2017. Inj. Prev. 2020;26(Suppl 2):i46-56.
Peelam MS, Naren, Gera M, Chamola V, Zeadally S. A Review on Emergency Vehicle Management for Intelligent Transportation Systems. IEEE Trans. Intell. Transp. Syst. 2024;25(11):15229–46. DOI
Heidari A, Navimipour NJ, Unal M. Applications of ML/DL in the management of smart cities and societies based on new trends in information technologies: A systematic literature review. Sustain. Cities Soc. 2022;85:104089. DOI
Mahrez Z, Sabir E, Badidi E, Saad W, Sadik M. Smart urban mobility: When mobility systems meet smart data. IEEE Trans. Intell. Transp. Syst. 2021;23(7):6222-39. DOI
Damaševičius R, Bacanin N, Misra S. From sensors to safety: Internet of Emergency Services (IoES) for emergency response and disaster management. J. Sens. Actuator Netw. 2023;12(3):41. DOI
Khan A, Gupta S, Gupta SK. Emerging UAV technology for disaster detection, mitigation, response, and preparedness. J. Field Robot. 2022;39(6):905-55. DOI
Razi A, Chen X, Li H, Wang H, Russo B, Chen Y, Yu H. Deep learning serves traffic safety analysis: A forward‐looking review. IET Intell. Transp. Syst. 2023;17(1):22-71. DOI
Abbasi M, Shahraki A, Taherkordi A. Deep learning for network traffic monitoring and analysis (NTMA): A survey. Comput. Commun. 2021;170:19-41. DOI
Alkinani MH, Khan WZ, Arshad Q. Detecting human driver inattentive and aggressive driving behavior using deep learning: Recent advances, requirements and open challenges. IEEE Access. 2020;8:105008-30. DOI
Paneru S, Jeelani I. Computer vision applications in construction: Current state, opportunities & challenges. Autom. Constr. 2021;132:103940. DOI
Kotseruba I, Tsotsos JK. Attention for vision-based assistive and automated driving: A review of algorithms and datasets. I IEEE Trans. Intell. Transp. Syst. 2022;23(11):19907-28. DOI
Oster Jr CV, Strong JS, Zorn CK. Analyzing aviation safety: Problems, challenges, opportunities. Res. Transp. Econ. 2013;43(1):148-64. DOI
Almaazmi AM, Alhammadi SH, Al Ali AA, Alzaabi NI, Kiklikian JM. Riding to the rescue: A comprehensive review of health and safety measures in ambulance cars. Int. J. Occup. Saf. Health. 2024;14(2):282-93. DOI
Subbarao B. Knowledge Management in Road Accident Detection Based on Developed Deep Learning. Adv. Eng. Intell. Syst. 2023;2(04):44-67.
Sherimon V, PC S, Ismaeel A, Babu A, Wilson SR, Abraham S, Joy J. An Overview of Different Deep Learning Techniques Used in Road Accident Detection. Int. J. Adv. Comput. Sci. Appl. 2023;14(11). DOI
Nassar DH, Al-Tuwaijari JM. A Review of Vehicle Accident Detection and Notification Systems Based on Machine Learning Techniques. Acad. Sci. J. 2024;2(2):105-26. DOI
Ekatpure R. Challenges and opportunities in the deployment of fully autonomous vehicles in urban environments in developing countries. J. Sustain. Technol. Infrastruct. Dev. Ctries. 2023;6:72-91.
Tsai CY, Su YK. MobileNet-JDE: a lightweight multi-object tracking model for embedded systems. Multimed. Tools Appl. 2022;81(7):9915-37. DOI
Charan Kumar C. Accident Detection From CCTV Footage. Kaggle. 2020. DOI
Zhu L, Zhang Q, Jian X, Yang Y. Graph convolutional network for traffic incidents duration classification. Eng. Appl. Artif. Intell. 2025;151:110570. DOI
Singh J, Singh G, Singh P, Kaur M. Evaluation and classification of road accidents using machine learning techniques. InEmerging Research in Computing, Information, Communication and Applications: ERCICA 2018. Springer Singapore. 2019;1:193-204. DOI
Sugetha C, Karunya L, Prabhavathi E, Sujatha PK. Performance evaluation of classifiers for analysis of road accidents. In2017 Ninth International Conference on Advanced Computing (ICoAC). IEEE. 2017:365-8). DOI

Cite this article:

Eskief, M. Z., Al Issa, N., Shouki, N., Chenghui, M. An improved MobileNetV2 architecture for efficient roadside accident detection in CCTV footage. DYSONA – Applied Science, 2025;6(2): 445-451. doi: 10.30493/das.2025.518818

E-NAMTILA