OpenPose Paper Summary

Top-Down:
1. Process: Detect people first (e.g., with a person detector), then estimate poses individually.
2. Issues:
  - Early commitment: Failure in person detection → no pose recovery.
  - Scalability: Runtime scales linearly with the number of people (e.g., 10 people = 10x slower).
Bottom-Up:
1. Process: Detect all body parts first (e.g., elbows, wrists), then group them into people.
2. Theoretical Advantage: Runtime independent of the number of people.
3. Practical Challenge: Grouping parts into people is an NP-Hard K-dimensional matching problem (exponential complexity in the number of people).

PAFs solve the grouping problem by encoding both location and orientation of limbs:

Structure:
- Each limb type (e.g., left forearm) has a dedicated 2D vector field.
- Vectors point from the parent joint to the child joint (e.g., shoulder → elbow).
Key Insight:
- Instead of solving the NP-Hard problem directly, use directional affinity scores to greedily associate joints.

Backbone: Truncated VGG-19 (first 10 layers) for feature extraction.
Two Parallel Branches:
- Confidence Maps Branch: Predicts heatmaps for joint locations (e.g., 18 maps for 18 joints).
- PAFs Branch: Predicts vector fields for limb orientations (e.g., 38 fields for 19 limb types × 2D vectors).
Multi-Stage Refinement:
- 6 Refinement Stages: Each stage takes features, previous confidence maps, and PAFs as inputs.
- Intermediate Supervision:
  - L_2 loss applied at each stage to both branches (Eq. 3–4).
  - Mitigates vanishing gradients by "replenishing" gradients periodically.

1. Detect Joint Candidates

Non-Maximum Suppression (NMS):
- Clean up confidence maps to retain only the strongest joint candidates.

2. Score Possible Limb Connections

For each limb type (e.g., neck-to-hip):