Title:
Visually grounded language understanding and generation

dc.contributor.advisor Parikh, Devi
dc.contributor.advisor Batra, Dhruv
dc.contributor.advisor Corso, Jason J.
dc.contributor.advisor Riedl, Mark O.
dc.contributor.advisor Hoffman, Judy
dc.contributor.author Lu, Jiasen
dc.contributor.department Computer Science
dc.date.accessioned 2020-05-20T16:59:01Z
dc.date.available 2020-05-20T16:59:01Z
dc.date.created 2020-05
dc.date.issued 2020-01-13
dc.date.submitted May 2020
dc.date.updated 2020-05-20T16:59:01Z
dc.description.abstract The world around us involves multiple modalities -- we see objects, feel texture, hear sounds, smell odors and so on. In order for Artificial Intelligence (AI) to make progress in understanding the world around us, it needs to be able to interpret and reason about multiple modalities. In this thesis, I take steps towards studying how inducing appropriate grounding in deep models improves multi-modal AI capabilities, in the context of vision and language. Specifically, I cover these four tasks: visual question answering, neural image captioning, visual dialog and vision and language pretraining. In visual question answering, we collected a large scale visual question answering dataset and I study various baselines to benchmark these tasks. To jointly reason about image and question, I propose a novel co-attention mechanism that can learn fine-grained grounding to answer the question. In image captioning, I address the model designs for grounded caption generation of a image. A key focus is to extend the model with the ability to know when to look at the image when generating each word. For the words which have explicit visual correspondence, we further proposed a novel approach that reconciles classical slot filling approaches with modern neural captioning approaches. As a result, our model can produce natural language explicitly grounded in entities that object detectors find in the image. In visual dialog, I study both sides of the visual dialog agents -- questioner and answerer. For modeling answerer which answers visual questions in dialog, I introduce a novel discriminant perceptual loss that transfers knowledge from a discriminative model a generative model. For modeling questioner, I consider an image guessing game as a test-bed for balancing task performance and language drift. I propose a Dialog without Dialog task, which requires agents to generalize from single round visual question generation with full supervision to a multi-round dialog-based image guessing game without direct language supervision. The proposed visually-grounded dialog models that can adapt to new tasks while exhibiting less linguistic drift. In vision and language pretraining, I study more general models that can learn visual groundings from massive meta-data on the internet. I also explore the multi-task vision and language representation learning. Our results not only show that a single model can perform all 12 vision and language tasks, but also that joint training can lead to improvements in task metric compared to single-task training with the same architecture. Through this work, I demonstrate that inducing appropriate grounding in deep models improves multi-modal AI capabilities. Finally, I briefly discuss the challenges in this domain and the extensions of recent works.
dc.description.degree Ph.D.
dc.format.mimetype application/pdf
dc.identifier.uri http://hdl.handle.net/1853/62745
dc.language.iso en_US
dc.publisher Georgia Institute of Technology
dc.subject Computer vision
dc.subject Natural language processing
dc.subject Visual question answering
dc.subject Multi-task learning
dc.subject Deep learning
dc.title Visually grounded language understanding and generation
dc.type Text
dc.type.genre Dissertation
dspace.entity.type Publication
local.contributor.advisor Parikh, Devi
local.contributor.advisor Riedl, Mark O.
local.contributor.advisor Hoffman, Judy
local.contributor.advisor Batra, Dhruv
local.contributor.corporatename College of Computing
local.contributor.corporatename School of Computer Science
relation.isAdvisorOfPublication 2b8bc15b-448f-472b-8992-ca9862368cad
relation.isAdvisorOfPublication 6512b353-3315-4dd1-9f47-7aaef3e19300
relation.isAdvisorOfPublication 403cff3c-8f25-4db5-978b-ef617a9f8b6a
relation.isAdvisorOfPublication bbee09e1-a4fa-4d99-9dfd-b0605fea0f11
relation.isOrgUnitOfPublication c8892b3c-8db6-4b7b-a33a-1b67f7db2021
relation.isOrgUnitOfPublication 6b42174a-e0e1-40e3-a581-47bed0470a1e
thesis.degree.level Doctoral
Files
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
Name:
LU-DISSERTATION-2020.pdf
Size:
43.98 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
LICENSE.txt
Size:
3.86 KB
Format:
Plain Text
Description: