YOLO Paper | Notion

YOLO-Pose: Enhancing YOLO for Multi Person Pose Estimation Using Object Keypoint Similarity Loss

This paper introduces YOLO-Pose, a novel approach that enhances YOLO for multi-person pose estimation by integrating Object Keypoint Similarity (OKS) loss. By leveraging the popular YOLO object detection framework, YOLO-Pose enables end-to-end training, optimizing the OKS metric directly. This framework excels in joint detection and 2D pose estimation by eliminating the need for post-processing steps, achieving state-of-the-art results on COCO datasets without test time augmentation, surpassing existing bottom-up methods in a single inference pass.

Introduction

Multi-person 2D pose estimation involves detecting individuals in an image and localizing their body joints. This task is challenging due to varying factors like the number of people, scale variations, occlusions, and the flexibility of the human body.

Key Points:

Existing pose estimation methods are divided into top-down and bottom-up approaches.
Top-down approaches use a person detector followed by single-person pose estimation, increasing complexity with more people in an image.
Bottom-up methods employ heatmaps for keypoint detection and require post-processing steps like NMS, line integrals, and grouping.
Post-processing in bottom-up approaches can be complex, dealing with issues like quantization errors and local maxima detection.
These methods lack sharpness to distinguish very close joints, and non-differentiable post-processing hinders end-to-end training.
Our motivation is to address pose estimation challenges without heatmaps, aligning with object detection principles that handle similar challenges.
YOLO-Pose aims to integrate pose estimation with object detection strategies, leveraging advancements in the object detection field.
By predicting human poses at multiple scales, following object detection frameworks, our approach benefits from object detection progress.
YOLO-Pose, based on YOLOv5, eliminates non-standard post-processing, demonstrating competitive accuracy on the COCO keypoint dataset.
Anchor boxes store full 2D poses along with bounding box locations, aiding in distinguishing close joints from different persons with the help of anchors.
Our method simplifies grouping by associating keypoints with anchors, reducing the need for additional post-processing.
YOLO-Pose's complexity remains constant regardless of the number of individuals in an image, offering the advantages of both top-down and bottom-up approaches.
Contributions include a unified approach to multi-person pose estimation and object detection, benefiting from advancements in object detection research.

YOLO-Pose: Enhancing YOLO for Multi Person Pose Estimation Using Object Keypoint Similarity Loss

Introduction

Key Points:

Related Work