Visually Grounded Neural Syntax Acquisition


Haoyue Shi1      Jiayuan Mao2      Kevin Gimpel1      Karen Livescu1     
1: Toyota Technological Institute at Chicago, Chicago, IL, USA.
2: ITCS, Institute for Interdisciplinary Information Sciences, Tsinghua University, China.

Paper/Code/Bib


Abstract

We present the Visually Grounded Neural Syntax Learner (VG-NSL), an approach for learning syntactic representations and structures without explicit supervision. The model learns by looking at natural images and reading paired captions. VG-NSL generates constituency parse trees of texts, recursively composes representations for constituents, and matches them with images. We define the concreteness of constituents by their matching scores with images, and use it to guide the parsing of text. Experiments on the MSCOCO data set show that VG-NSL outperforms various unsupervised parsing approaches that do not use visual grounding, in terms of F1 scores against gold parse trees. We find that VG-NSL is much more stable with respect to the choice of random initialization and the amount of training data. We also find that the concreteness acquired by VG-NSL correlates well with a similar measure defined by linguists. Finally, we also apply VG-NSL to multiple languages in the Multi30K data set, showing that our model consistently outperforms prior unsupervised approaches.

Overview


Training

The parsing module parses the caption into a constituency parse tree.
The visual recognizer extracts visual features of images.
The concreteness scores are estimated in the joint visual-semantic embedding space, and are passed back to train the parser.

Inference

The parser does not need any parallel image to parse a sentence.

Related Resources