Connected Papers: Real-time Human Pose Estimation

Multi-task Deep Learning for Real-Time 3D Human Pose Estimation and Action Recognition

Action recognition is a computer vision task that involves identifying and classifying the actions or activities being performed by one or more individuals in a video or sequence of images. This task typically involves analyzing the motion and posture of the human body to determine what action is being performed, such as walking, running, jumping, waving, or sitting.

No need for action recognition in our project.

VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera

If your goal is to detect keypoints from a mobile video, you might not necessarily need a model that deals with 3D images. However, there are several advantages to using 3D pose estimation:

Depth Information: 3D pose estimation provides depth information, which can improve the accuracy of keypoint detection, especially in complex scenes with occlusions or varying distances.
Robustness: 3D models can be more robust to changes in camera angles and perspectives, as they capture the spatial relationships between keypoints in three dimensions.
Enhanced Applications: 3D pose estimation enables more advanced applications, such as augmented reality, motion capture, and detailed activity recognition, which require understanding the full spatial configuration of the body.

To provide 3D information for a model using a mobile camera, you can use several techniques:

Stereo Cameras:
- Use a mobile device with dual cameras (stereo cameras) to capture depth information. The disparity between the two camera images can be used to estimate depth and create a 3D representation.
Depth Sensors:
- Some mobile devices are equipped with depth sensors (e.g., LiDAR on newer iPhones and iPads) that can directly capture depth information.
Monocular Depth Estimation:
- Use deep learning models trained for monocular depth estimation to infer depth from a single camera image. These models can predict depth information based on learned patterns and context from the image.
  
  <aside> 📖
  
  Monocular depth estimation is a computer vision technique that predicts the depth information of a scene from a single image (monocular image). Unlike stereo vision, which requires two images from different viewpoints to estimate depth, monocular depth estimation relies on a single image and uses machine learning models to infer depth based on learned patterns and context.
  
  </aside>
Augmented Reality (AR) Frameworks:
- Utilize AR frameworks like ARKit (iOS) or ARCore (Android) that provide APIs for real-time 3D tracking and depth estimation using the device's camera and sensors.

Reconstructing the 3D image seems reasonable for the discussed reasons and it seems like their are enough tools for that with different levels of “developer involvement”.

Real-time 2D Multi-Person Pose Estimation on CPU: Lightweight OpenPose

Edge devices are computing devices that are located at the edge of a network, close to the source of data generation. These devices perform data processing and analysis locally, rather than sending all the data to a centralized cloud or data center. Edge devices are used in edge computing, which aims to reduce latency, improve response times, and decrease bandwidth usage by processing data near its source.

This paper/repo has over 2K stars so it seems dependable and its also lightweight which means it could be deployed seamlessly on mobiles.