Detect people from cctv footage using yolov3

Introduction

In this blog, we are going to see how yolov3(a pre-trained model) is used to detect humans in CCTV footage, To follow up you need to have prior knowledge in Python cause am going to do this prediction using Python. This process involves extracting frames from the CCTV footage. I have already done this in my previous blog Check that out to know more about frame manipulation using OpenCV. frame extraction.

Prerequisites

Prior knowledge of Python concepts like loops and conditional statements.
OpenCV, a computer-vision library to handle video frames.
Numpy library to manipulate the frames.

Understanding YOLOV3 model

yolov3 - You only look once Version-3, is a deep learning model used for real-time object detection in images and videos, This model divides an image into grids and predicts boundary boxes, object classes, and confidence scores for each grid.

You need to know about deep learning, and neural networks to understand this model. If you are only concerned about the process you don't need to understand that cause yolov3 is a pre-trained model we just load the weights and cfg files of yolov3 for our process. neural network basics.

Weights file - https://pjreddie.com/darknet/yolo/

The weights file contains the parameters that define the neural network architecture.
cfg file - https://github.com/pjreddie/darknet/blob/master/cfg/yolov3.cfg

cfg file has the architecture of the neural network

Coding part

Let me give you a heads-up before starting the coding part, I already extracted frames from the footage, check this blog(from blogger)for extract frames from a video.

Load weights and cfg files.
Extract the frames and save them to the specified location.

initialize variables for counting the person in each frame.

  net = cv2.dnn.readNet("yolov3.weights", "yolov3.cfg")
  highest_people_count = 0
  highest_frame_number = -1
  lowest_people_count = float('inf')
  lowest_frame_number = -1

Loop through each frame, I extracted 71 frames from my 35-second footage.

for i in range(1, 72):  # Assuming you have 71 frames extracted
    # Load the extracted frame
    frame_path = ('extracted_frame_path')
    image = cv2.imread(frame_path)
    height, width = image.shape[:2]

The above loop is the outermost loop, the following codes are placed inside this for loop

Preprocess the frames
```
  blob = cv2.dnn.blobFromImage(image, 0.00392, (416, 416), swapRB=True, crop=False)
      net.setInput(blob)
```
blobFromImage module resize the image 426x416, swapRB swap the color channels RBG to BGR for the input process.
```
      outs = net.forward(net.getUnconnectedOutLayersNames())
```
net.getUnconnectedOutLayersNames() - used to retrieve the names of output layers that do not have any connections, net. forward, forward the input to the network.
Define threshold levels
```
  conf_threshold = 0.5
  nms_threshold = 0.4
```
If the confidence level is below the threshold level means the image won't be considered.

Initialize lists to store the detection levels.

  class_ids = []
  confidences = []
  boxes = []

Detection process

  for out in outs:
          for detection in out:
              scores = detection[5:]
              class_id = np.argmax(scores)
              confidence = scores[class_id]

              if confidence > conf_threshold and class_id == 0:  # 0 corresponds to the 'person' class
                  center_x = int(detection[0] * width)
                  center_y = int(detection[1] * height)
                  w = int(detection[2] * width)
                  h = int(detection[3] * height)
                  x = int(center_x - w / 2)
                  y = int(center_y - h / 2)

                  class_ids.append(class_id)
                  confidences.append(float(confidence))
                  boxes.append([x, y, w, h])

If the class is 0 means it's people we can draw a boundary box for that person.

Apply non-maximum suppression, to filter out the overlapping boundaries.

  indices = cv2.dnn.NMSBoxes(boxes, confidences, conf_threshold, nms_threshold)

I conclude my process by printing the frames that have the lowest and highest count of detection from the footage.

if count > highest_people_count:
        highest_people_count = count
        highest_frame_number = i
    if count < lowest_people_count:
        lowest_people_count = count
        lowest_frame_number = i
if highest_frame_number != -1:
    print(f"Frame {highest_frame_number} has the highest people count: {highest_people_count}")
if lowest_frame_number != -1:
    print(f"Frame {lowest_frame_number} has the lowest people count: {lowest_people_count}")

This is my output

Frame 3 has the highest people count: 5
Frame 14 has the lowest people count: 2

Conclusion

This may look simple but too many concepts are used here, Try to change every single parameter, and before jumping into the code be familiar with basic deep learning concepts and neural network architecture. I try to cover the basics alone, even am also new to this field, I hope that I try to cover everything in my upcoming blogs.