Using Segmentation to Estimate Human Body Pose from Bottom-up

    (under construction)


(a) (b) (c)

Figure 1: the goal of this work is to take an image such as the one in Figure 1(a), detect a human figure, and localize his joints and limbs (b) along with their associated pixel masks (c).

Pixels, Superpixels, and Segmentations

(a) (b) (c) (d)

Figure 2: stages of low-level processing: (a) Input image. (b) Edge map (Pb); it handles high-frequency texture nicely. (c) A Normalized Cuts segmentation with k=40. Salient limbs pop out as single segments; head and torso consist of several segments. (d) A "superpixel" map with 200 superpixels. It captures all the details.

Flow of the Algorithm

Learning Limb Detection

Each segment (Figure 1(c)) is a candidate for limb detection. We formulate the limb detection problem as a classification between "good" limbs (from groundtruth human data) and "bad" limbs (random segments). The features we use include:

  1. contour cue: the low-level contrast between a segment and its surroundings.
  2. shape cue: shape similarity between a shape and a template (rectangle).
  3. shading cue: shading similarity to a stored template (as a limb often has a characteristic shading pattern).
  4. focus cue: the foreground is typically in focus, thus has high-frequency content (a weak cue, present in many news photos).
A logistic classifier is used to combine these cues.

This figure shows the performance of our (half-) limb detector. In average, if we keep 8 top candidates per image, we may detect about 4 true half-limbs (seems a good trade-off point).

Assembly of Parts

For every baseball player image, there are a few half-limbs that are very salient and easily detectable, which we call the islands of saliency. In average we can detect 4 half-limbs among the top 8 candidates. We also have 5 candidates for torso, obtained from a similar classifier. The goal of the assembly stage is to find out which top candidates are true half-limbs, label them and use them to recover the whole-body configuration.

To label the half-limbs and the torso, we use the following global configuration cues:

  1. relative width: limbs may be foreshortened but their widths remain unchanged. Therefore, the relative width of an upper-leg and a lower-arm must be compatible with each other (we use anthropometric data as groundtruth).
  2. length given torso: we assume that torsos are not foreshortened much. Therefore a torso candidate gives conservative upper bounds on the lengths of limbs.
  3. adjacency: adjacent parts after all should be adjacent to each other.
  4. symmetry of clothing: symmetric limbs should have the same appearance.

We use exhaustive search with pruning to find the best configurations. Because we are able to limit the number of candidates (with the help of sophisticated low-level processing), the search space is reasonably small.

In a more recent work (ICCV 2005) we have employed Integer Quadratic Programming (IQP) to solve the assembly problem.


  1. Recovering Human Body Configurations: Combining Segmentation and Recognition.   [abstract] [pdf] [ps] [bibtex]
      Greg Mori, Xiaofeng Ren, Alyosha Efros and Jitendra Malik, in CVPR '04, volume 2, pages 326-333, Washington, DC 2004.