HYBRIDIZED CONVOLUTIONAL LONG SHORT-TERM MEMORY AND SUPPORT VECTOR MACHINES MODEL FOR VIOLENCE DETECTION IN SURVEILLANCE FOOTAGE

Overview
Cite Work

There has been widespread use of Closed-circuit Television (CCTV) surveillance cameras in both public and private settings to increase security. The bitrate for an FHD (Full High Definition) camera operating at thirty frames per second (30 fps) with moderate compression is eight megabits per second (Mbps). Based on the assumption of this bitrate and a twenty-four-hour recording period, the approximate daily data output of a single FHD camera would then amount to approximately eighty-six Gigabytes (86 GB). Monitoring and analyzing all of this material footage is challenging due to the large volume of the video data. Consequently, machine learning models have been utilized to automate analysis of surveillance footages in order to detect any forms of violence. While these models have demonstrated promising outcomes, they continue to face challenges in terms of processing speed and accuracy, particularly in the extraction of spatiotemporal features. This study developed a model based on the Convolutional Long-Short-Term Memory and Support Vector Machines (Conv-LSTM-SVMs) approach for detecting violence in CCTV surveillance footage. Convolutional Neural Networks (CNNs) are a type of deep neural networks that are made to handle organised grid data, like images. Long Short-Term Memory (LSTM) networks belong to the family of Recurrent Neural Networks (RNNs) and are designed for processing sequential data. Support Vector Machines (SVMs) are a type of supervised machine learning method used for tasks like regression and classification. The integration of CNNs, LSTM networks, and SVMs leverages the unique advantages of each design, resulting in a comprehensive approach. The model was developed, trained and tested using the Keras library running on TensorFlow, using an experimental research design. The impact of various hyper-parameters on the performance of the hybridized model was investigated, and the results used to optimize the model for better performance. The UCF-Crime dataset was used for model training, validation, and testing, while the RWF-2000 dataset was used for external validation. The training data was augmented to ensure the model was well trained on the wide range of violent and non-violent activities it may experience in real-world settings. The model’s performance was evaluated, and a comparative table used to compare the speed and recognition accuracy of the hybrid model against that of similar existing state of the art models. With an accuracy of 97.8%, the Conv-LSTM-SVM model demonstrated its potency in identifying violent action in surveillance footage, against 75%, 80% and 97% of the LSTM, CNN, and Convolutional Long-Short-Term Memory (Conv-LSTM) models respectively. Even though the Two-Stream Fusion CNN model demonstrated a marginally greater accuracy of 97.8%, the hybrid model demonstrated relatively higher computational efficiency with a low inference time of 36 milliseconds, and a training time of nine hours. Experimentation revealed that optimal regularization can be achieved by using a dropout rate of 0.5, learning rate of 0.001 and a batch size of 32. The Adam optimizer demonstrated the most rapid convergence, achieving experimental convergence in a span of 145 minutes. When tested on an unseen heterogeneous RWF-2000 dataset, the model verified cross-domain viability with 91.3% detection accuracy without retraining. The excellent performance and efficacy in accurately identifying violent behaviour make the hybrid model a feasible tool for enhancing public safety and security in a range of surveillance scenarios.

Read Download

HYBRIDIZED CONVOLUTIONAL LONG SHORT-TERM MEMORY AND SUPPORT VECTOR MACHINES MODEL FOR VIOLENCE DETECTION IN SURVEILLANCE FOOTAGE

Related Works