This is part of an article.

Detection and feature extraction

Deep SORT starts with object detection, often using a convolutional neural network (CNN) like YOLO (You Only Look Once) to identify objects within a frame. Each detection is associated with a high-dimensional appearance descriptor extracted by another CNN. These descriptors encode the appearance of the detected objects and are used for matching.

Kalman filter in state prediction

Similar to SORT, Deep SORT uses a Kalman filter for state prediction. The state of an object typically includes its position, velocity, and acceleration. The Kalman filter predicts the state of each object in the current frame based on its last known state, accounting for the object's motion dynamics.

Data association with deep appearance metrics

Data association in Deep SORT is where it significantly deviates from its predecessor. Instead of relying solely on IoU, it employs a combination of motion information and appearance features.

The Hungarian algorithm performs matching based on a cost matrix that considers both the Mahalanobis distance for motion consistency and the cosine distance for appearance similarity. The use of a deep association metric allows Deep SORT to continue tracking through short periods of occlusion.

Track management

Deep SORT manages track lifecycles using several heuristics. It introduces concepts such as track confirmation, where a track is only confirmed after being detected in several consecutive frames, thereby reducing false positives. There is also an age parameter to remove old tracks that have not been detected recently, ensuring the system only maintains active tracks.