OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields
1. Background: Top-Down vs. Bottom-Up Approaches
- Top-Down:
- Process: Detect people first (e.g., with a person detector), then estimate poses individually.
- Issues:
- Early commitment: Failure in person detection → no pose recovery.
- Scalability: Runtime scales linearly with the number of people (e.g., 10 people = 10x slower).
- Bottom-Up:
- Process: Detect all body parts first (e.g., elbows, wrists), then group them into people.
- Theoretical Advantage: Runtime independent of the number of people.
- Practical Challenge: Grouping parts into people is an NP-Hard K-dimensional matching problem (exponential complexity in the number of people).
2. OpenPose’s Core Innovation: Part Affinity Fields (PAFs)
PAFs solve the grouping problem by encoding both location and orientation of limbs:
- Structure:
- Each limb type (e.g., left forearm) has a dedicated 2D vector field.
- Vectors point from the parent joint to the child joint (e.g., shoulder → elbow).
- Key Insight:
- Instead of solving the NP-Hard problem directly, use directional affinity scores to greedily associate joints.
3. Architecture
- Backbone: Truncated VGG-19 (first 10 layers) for feature extraction.
- Two Parallel Branches:
- Confidence Maps Branch: Predicts heatmaps for joint locations (e.g., 18 maps for 18 joints).
- PAFs Branch: Predicts vector fields for limb orientations (e.g., 38 fields for 19 limb types × 2D vectors).
- Multi-Stage Refinement:
- 6 Refinement Stages: Each stage takes features, previous confidence maps, and PAFs as inputs.
- Intermediate Supervision:
- L_2 loss applied at each stage to both branches (Eq. 3–4).
- Mitigates vanishing gradients by "replenishing" gradients periodically.
4. Greedy Parsing with PAFs
1. Detect Joint Candidates
- Non-Maximum Suppression (NMS):
- Clean up confidence maps to retain only the strongest joint candidates.
2. Score Possible Limb Connections
For each limb type (e.g., neck-to-hip):