The parsing module
parses the caption into a constituency parse tree.
The visual recognizer
extracts visual features of images.
The concreteness scores
are estimated in the joint visual-semantic embedding space
, and are passed back to train the parser.
The parser does not need any parallel image to parse a sentence.