Filter Out Papers

DensePose: Dense Human Pose Estimation In The Wild

Key Points:

Dense Mapping vs. Key Points:
- Traditional key point detection focuses on identifying specific joints (e.g., elbows, knees) on the human body. Dense human pose estimation, on the other hand, aims to map every pixel of the human body to a 3D surface model, providing a much more detailed representation.
Real-Time Performance:
- The system they developed can deliver highly accurate results in real time, making it suitable for applications that require fast and detailed human pose estimation.
They have a dataset of size 50K which is useful. They have over 25K stars. I am not sure though if we need this level of detailed mapping.

RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation

With RTMW-l achieving a 70.2 mAP on the COCO-Wholebody benchmark, making it the first open-source model to exceed 70 mAP on this benchmark.

<aside> 📖

COCO-WholeBody Benchmark:
- The COCO-WholeBody benchmark is a dataset and evaluation framework that extends the COCO dataset to include annotations for the entire human body, including key points for the face, hands, and feet, in addition to the standard body key points.
- This benchmark is used to evaluate models on their ability to detect and estimate the full set of key points for the whole human body. </aside>
Outstanding Performance.

Fast and Flexible Human Pose Estimation with HyperPose

In this paper, we introduce Hyperpose, a novel flexible and high-performance pose estimation library. Hyperpose provides expressive Python APIs that enable developers to easily customise pose estimation algorithms for their applications. It further provides a model inference engine highly optimised for real-time pose estimation. This engine can dynamically dispatch carefully designed pose estimation tasks to CPUs and GPUs, thus automatically achieving high utilisation of hardware resources irrespective of deployment environments.

Good enough for a proof of concept.

Kimera: from SLAM to Spatial Perception with 3D Dynamic Scene Graphs

Irrelevant.

ViPNAS: Efficient Video Pose Estimation via Neural Architecture Search

Example: In a video pose estimation task, the spatial level optimization might involve designing a convolutional neural network (CNN) that effectively extracts features from each frame. The temporal level optimization might involve designing a recurrent neural network (RNN) or using temporal convolutional layers to combine features from multiple frames, ensuring that the pose estimation is accurate and consistent over time.

By optimizing both spatial and temporal aspects, the proposed ViPNAS method aims to achieve a better trade-off between accuracy and efficiency, enabling fast and accurate online video pose estimation.

We could learn something about the trade-off between accuracy and efficiency.

End-to-end Recovery of Human Shape and Pose

Produces a 3D mesh instead of only detecting the 2D keypoints, like a paper mentioned earlier but using a different technique.

Effective Whole-body Pose Estimation with Two-stages Distillation

Example:

Imagine you have a complex teacher model that has been trained for human pose estimation. This model is accurate but computationally expensive to run. You want to create a smaller, more efficient student model that can be deployed on devices with limited resources (e.g., mobile phones).
1. First Stage:
  - Train the student model using the teacher model's outputs. The student model learns from the teacher's intermediate features and final logits, with supervision on both visible and invisible keypoints.
2. Second Stage:
  - Use the student model's own predictions and features to further refine its performance. This self-distillation process helps the student model achieve higher accuracy.
The concept of Knowledge Distillation might be useful in this or other tasks.

Human3.6m: Large scale datasets and predictive methods for 3D human sensing in natural environments

I guess these are synthesized datasets of random settings, we could explore their technique if we needed to play around with our data.

Learning from Synthetic Humans

Synthetic data again.

OpenPifPaf: Composite Fields for Semantic Keypoint Detection and Spatio-Temporal Association
- Traditional methods might separate the tasks of detecting key points and tracking them into multiple stages. For example, first detecting key points in each frame and then linking them across frames.
- A single-stage framework performs both detection and tracking simultaneously in one unified process, which can be more efficient and faster.
Worth considering specially if speed became an issue.

V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map

This paper covers two concerns about using CNNs to get the 3D keypoints from 2D depth maps and offers an solutions to these concerns.

If we decided to use CNNs we better check the validity of the concerns covered here.

DensePose: Dense Human Pose Estimation In The Wild

RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation

Fast and Flexible Human Pose Estimation with HyperPose

Kimera: from SLAM to Spatial Perception with 3D Dynamic Scene Graphs

ViPNAS: Efficient Video Pose Estimation via Neural Architecture Search

End-to-end Recovery of Human Shape and Pose

Effective Whole-body Pose Estimation with Two-stages Distillation

Human3.6m: Large scale datasets and predictive methods for 3D human sensing in natural environments

Learning from Synthetic Humans

OpenPifPaf: Composite Fields for Semantic Keypoint Detection and Spatio-Temporal Association

V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map