Visually grounded language understanding and generation

Lu, Jiasen

Title:

Visually grounded language understanding and generation

Files

LU-DISSERTATION-2020.pdf (43.98 MB)

Author(s)

Lu, Jiasen

Advisor(s)

Parikh, Devi
Batra, Dhruv
Corso, Jason J.
Riedl, Mark O.
Hoffman, Judy

Advisor(s)

Person

Parikh, Devi

Person

Riedl, Mark O.

Person

Hoffman, Judy

Person

Batra, Dhruv

Associated Organization(s)

Organizational Unit

College of Computing

Organizational Unit

School of Computer Science

Collections

Theses and Dissertations

Permanent Link

http://hdl.handle.net/1853/62745

Abstract

The world around us involves multiple modalities -- we see objects, feel texture, hear sounds, smell odors and so on. In order for Artificial Intelligence (AI) to make progress in understanding the world around us, it needs to be able to interpret and reason about multiple modalities. In this thesis, I take steps towards studying how inducing appropriate grounding in deep models improves multi-modal AI capabilities, in the context of vision and language. Specifically, I cover these four tasks: visual question answering, neural image captioning, visual dialog and vision and language pretraining. In visual question answering, we collected a large scale visual question answering dataset and I study various baselines to benchmark these tasks. To jointly reason about image and question, I propose a novel co-attention mechanism that can learn fine-grained grounding to answer the question. In image captioning, I address the model designs for grounded caption generation of a image. A key focus is to extend the model with the ability to know when to look at the image when generating each word. For the words which have explicit visual correspondence, we further proposed a novel approach that reconciles classical slot filling approaches with modern neural captioning approaches. As a result, our model can produce natural language explicitly grounded in entities that object detectors find in the image. In visual dialog, I study both sides of the visual dialog agents -- questioner and answerer. For modeling answerer which answers visual questions in dialog, I introduce a novel discriminant perceptual loss that transfers knowledge from a discriminative model a generative model. For modeling questioner, I consider an image guessing game as a test-bed for balancing task performance and language drift. I propose a Dialog without Dialog task, which requires agents to generalize from single round visual question generation with full supervision to a multi-round dialog-based image guessing game without direct language supervision. The proposed visually-grounded dialog models that can adapt to new tasks while exhibiting less linguistic drift. In vision and language pretraining, I study more general models that can learn visual groundings from massive meta-data on the internet. I also explore the multi-task vision and language representation learning. Our results not only show that a single model can perform all 12 vision and language tasks, but also that joint training can lead to improvements in task metric compared to single-task training with the same architecture. Through this work, I demonstrate that inducing appropriate grounding in deep models improves multi-modal AI capabilities. Finally, I briefly discuss the challenges in this domain and the extensions of recent works.

Date Issued

2020-01-13

Resource Type

Text

Resource Subtype

Dissertation

Full item page

Title:

Visually grounded language understanding and generation

Files

Author(s)

Authors

Advisor(s)

Advisor(s)

Editor(s)

Associated Organization(s)

Series

Collections

Supplementary to

Permanent Link

Abstract

Sponsor

Date Issued

Extent

Resource Type

Resource Subtype

Rights Statement

Rights URI

Georgia Tech Library

Title: Visually grounded language understanding and generation

Files

Author(s)

Authors

Advisor(s)

Advisor(s)

Editor(s)

Associated Organization(s)

Series

Collections

Supplementary to

Permanent Link

Abstract

Sponsor

Date Issued

Extent

Resource Type

Resource Subtype

Rights Statement

Rights URI

Title:

Visually grounded language understanding and generation