Hierarchies allow feature sharing between objects at multiple levels
of representation, can code exponential variability in a
compact way and enable fast inference. This makes them potentially
suitable for learning and recognizing a higher number of object
classes.
We developed a novel framework
for learning a hierarchical compositional shape vocabulary
for representing multiple object classes. The approach takes simple
contour fragments and learns their frequent spatial configurations.
These are recursively combined into increasingly more complex and class-specific shape compositions. The lower layers are learned jointly on
images of all classes, whereas the higher layers of the vocabulary
are learned incrementally, by presenting the algorithm with one
object class after another. The experimental results show that the
learned multi-class object representation scales favorably with the
number of object classes.
We are interested in how semantic segmentation can help object detection. We propose a novel deformable part-based model which exploits segmentation algorithms that compute candidate object regions. Our approach allows every detection hypothesis to select a segment, and scores each box in the image using both the traditional HOG filters as well as a set of novel segmentation features. Thus our model ``blends'' between the detector and segmentation models. Our approach significantly outperforms DPM and existing state-of-the-art approaches on the challenging PASCAL VOC 2010 dataset.
We are interested in the problem of category-level 3D object detection. Given an image, our aim is to localize the objects in 3D by enclosing them with tight oriented 3D bounding boxes. For monocular images we extend the deformable part-based model to reason in 3D. We represent an object class as a deformable 3D cuboid composed of faces and parts, which are both allowed to deform with respect to their anchors on the 3D box. Inference then entails sliding and rotating the box in 3D and scoring object hypotheses. While for inference we discretize the search space, the variables are continuous in our model. For RGB-D data we propose a CRF model that reasons jointly about the class of each cuboid, scene type, and object-object, object-scene interdependencies.
We developed several approaches to holistic scene understanding from RGB and RGB-D imagery that reason jointly about multiple related tasks such as segmentation, detection, annotation and scene classification. We represent the problem with a Conditional Random Field with carefully designed potentials. We show that the joint model improves performance in all tasks. Furthermore, our model achieves state-of-the-art in segmentation on the MSRC dataset while being an order of magnitude faster. For detection, our model significanly improves over DPM.
We propose a holistic CRF model for semantic parsing that employs text as well as image information as input.
We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model,
both via potentials as well as by re-ranking candidate detections. We significantly improve over the visual only
model.
For text generation, we developed a system that produces sentential
descriptions of video: who did what to whom,
and where and how they did it.