Many existing algorithms in computer vision use the pixel-grid as the
underlying representation. For example, stochastic models of images, such as
Markov random fields, are often defined on this regular grid. Or, face
detection is typically done by matching stored templates to every fixed-size
(say, 50x50) window in the image.
The pixel-grid, however, is not a natural representation of visual scenes. It
is rather an "artifact" of a digital imaging process.
It would be more natural, and presumably more efficient, to work with
perceptually meaningful entities obtained from a low-level grouping process.
For example, we can apply the Normalized Cuts
algorithm to partition an image into, say, 500 segments (what we call superpixels).
Such a superpixel map has many desired properties:
- It is computationally efficient: it reduces the complexity of
images from hundreds of thousands of pixels to only a few hundred superpixels.
- It is also representationally efficient: pairwise constraints
between units, while only for adjacent pixels on the pixel-grid, can now model
much longer-range interactions between superpixels.
- The superpixels are percetually meaningful: each superpixel is a
perceptually consistent unit, i.e. all pixels in a superpixel are most likely
uniform in, say, color and texture.
It is near-complete:
because superpixels are results of an oversegmentation,
most structures in the image are conserved. There is very
little loss in moving from the pixel-grid to the superpixel map.
It is actually not novel to use superpixels or atomic regions to speed up
later-stage visual processing; the idea has been around the community for a
while. What we have done is: (1) to empirically validate the completeness of
superpixel maps; and (2) to apply it to solve challenging vision problems such
as finding people in static images.
Superpixels from the Normalized Cuts
The Normalized Cuts
is a classical region segmentation algrithm developed at Berkeley, which uses
spectral clustering to exploit pairwise brightness, color and texture
affinities between pixels. We apply the Normalized Cuts to oversegment images
to obtain superpixels. In our experiments, to enforce locality we use only
local connections in the pairwise affinity matrix.
Figure 1: an example of superpixel maps. (a) is the original image; (b) is a human marked
segmentation; (c) is a superpixel map with k=200; (d) shows a reconstruction
of the human segmentation from the superpixels: we assign each superpixel to a
segment in (b) with the maximum overlapping area and extract the superpixel
Figure 1 shows an example of superpixels. If we compare the human-marked
segmentation (the groundtruth) to the one reconstructed from the superpixel map, we may find that
some contour details are lost in the process of oversegmentation (such as at
the upper-left corner). However most structures are conserved; and the
reconstructed segmentation is qualitatively very similar to the groundtruth.
To empirically validate the completeness of superpixels maps, we use a
boundary-based strategy: each human-marked segmentation defines a boundary map
( Figure 1(b) ); so does each superpixel map ( Figure 1(c) ). We can try to
match the groundtruth boundaries to the superpixel boundaries: if a groundtruth
boundary pixel is within a distance threshold (say, 2 pixels) of a superpixel
boundary pixel, we count it as a hit. The overall recall rate, percentage of
groundtruth boundary pixels being matched to superpixel boundaries, is a
measure of how much structure is being conserved.
For images of size (240x160) in the BSDS, the figure above shows the recall
rates of superpixel maps from k=50 to 400. Overall the recall rates are quite
good (especially if taken into consideration the complexities of the BSDS
images), indicating that a state-of-the-art segmentation algorithm is able to
produce superpixel maps with little loss of perceptually important structure. In
particular, at the threshold of 2 pixels, the recall rate is about 90% for 200
superpixels. This is sufficient for our work in learning
discriminative models of segmentation and finding
people in static images.
Greg Mori has released a version of our superpixel code in
matlab. Alyosha Efros has used other region segmentation algorithms in his recent work using
superpixels. Alternatively, we have developed a boundary-oriented superpixel
algorithm, the CDT graphs, which is scale-invariant
(and very fast).
- Learning a Classification Model for Segmentation.
Xiaofeng Ren and Jitendra Malik, in ICCV '03, volume 1, pages 10-17, Nice 2003.
- Recovering Human Body Configurations: Combining Segmentation and Recognition.
Greg Mori, Xiaofeng Ren, Alyosha Efros and Jitendra Malik, in CVPR '04, volume 2, pages 326-333, Washington, DC 2004.