Learning Discriminative Models for Image Segmentation


Region segmentation is a key problem in mid-level vision and has received a lot of attention in the recent years. In this work, we try to answer the following question:

         What is a good segmentation?

Gestalt psychologists have long revealed to us various grouping cues, such as similarity, continuity, etc. Computationally, however, we do not yet have a satisfying answer. Most existing algorithms in segmenation, such as the Normalized Cuts, use hand-constructed criterion and/or hand-picked parameters. It is our belief that a good answer to this question relies in the use of human-marked groundtruth data, on which segmentation algorithms can be rigorously trained and tested.

If we look closely at the classical Gestalt laws of grouping, we can find that most of them are discriminative in nature. For example, when Wertheimer talked about his law of proximity, he used this picture: and ask which is a better grouping, ab/cd/ef or a/bc/de/fg ? And proximity tells us that the former is more preferable.

This inspires us to develop a discriminative model of segmentation. Figure 1 shows an example, where "good" segmentations are given by human subjects, and "bad" segmentations are from random matching of images and masks. We will use Gestalt grouping cues as features and train a classifier to distinguish "good" from "bad".

(a) (b) (c)
       Figure 1: we formulate segmentation as classification between good segmentations (b) (given by human subjects) and bad segmentations (c) ( random matching of images and masks).  

Cues for Grouping

We proprocess images into superpixel maps with 200 superpixels per image. This small number 200 means that we are almost free to define any feature we like (as we can afford to compute and search over them).

We consider a "good" segment S in a segmentation S, and define the following features for S:

  • inter-region contour energy, Eext:
       how strong the contour contrast is along the boundary of a segment.
  • intra-region contour energy, Eint:
       how strong the contour contrast is in the interior of a segment.
  • inter-region brightness (dis)similarity, Text:
       how similar in brightness a segment is to surrounding segments.
  • intra-region brightness similarity, Tint:
       how consistent in brightness a segment is in its interior.
  • inter-region texture (dis)similarity, Bext:
       how similar in texture a segment is to surrounding segments.
  • intra-region texture similarity, Bint:
       how consistent in texture a segment is in its interior.
  • curvilinear continuity, C:
       how smooth the boundary of a segment is.

How useful are these features? To quantify this, we can measure the mutual information between a feature f and the target binary label h (where h=1 if S is from a good segmentation, 0 otherwise):

FeaturesContour EnergyTextureBrightnessContinuity
inter-intra- inter-intra- inter-intra-
Mutual Info0.3870.0120.1370.0300.1120.0490.198

We find that: (1) inter-region features are much more informative than intra-region features, suggesting that discriminative grouping cues are more useful than generative cues; (2) contour features are the most informative, followed by texture and brightness; and (3) continuity is a quite useful cue by itself. Of course these are only marginal information measures; we will look at cue combination in the next section.

Learning Cue Combination

How to combine these Gestalt cues? First we want to develop some intuition about the interactions between the features. As we are dealing with a classification problem, one way to study the data is to plot both the positive and negative examples in the feature space, and look for a good classification boundary between them. Figure 2 shows some examples, where we look at pairs of features, and the distributions are shown as iso-probability contours.

These distributions suggest that:

  1. the features are relatively well-behaved; for both classes, a Gaussian model would be a reasonable approximation.
  2. a linear classifier would perform well.

Therefore we define our objective function, i.e. the "goodness" of a segment S, as a linear combination of the features:

G(S) = ∑ cj fj

where the weights cj are learned from logistic regression. Quantitative evaluations show that such a linear combination captures most of the information available in these features.

Figure 2: iso-probability contour plots of empirical distributions for a pair of features.

For the "goodness" of an entire segmentation S, we simply add G(S) for all segments S (i.e. assume they are independent). To actually look for good segmentations given a novel image, we use the simulated annealing algorithm to optimize the linear objective function.

Some Results

These results are obtained by combining the linear object function above with a Gaussian prior on the size of segments (estimated from groundtruth as well).


  1. Learning a Classification Model for Segmentation.   [abstract] [pdf] [ps] [bibtex]
      Xiaofeng Ren and Jitendra Malik, in ICCV '03, volume 1, pages 10-17, Nice 2003.