Visually grounded language understanding and generation

Lu, Jiasen

Title:

Visually grounded language understanding and generation

dc.contributor.advisor	Parikh, Devi
dc.contributor.advisor	Batra, Dhruv
dc.contributor.advisor	Corso, Jason J.
dc.contributor.advisor	Riedl, Mark O.
dc.contributor.advisor	Hoffman, Judy
dc.contributor.author	Lu, Jiasen
dc.contributor.department	Computer Science
dc.date.accessioned	2020-05-20T16:59:01Z
dc.date.available	2020-05-20T16:59:01Z
dc.date.created	2020-05
dc.date.issued	2020-01-13
dc.date.submitted	May 2020
dc.date.updated	2020-05-20T16:59:01Z
dc.description.abstract	The world around us involves multiple modalities -- we see objects, feel texture, hear sounds, smell odors and so on. In order for Artificial Intelligence (AI) to make progress in understanding the world around us, it needs to be able to interpret and reason about multiple modalities. In this thesis, I take steps towards studying how inducing appropriate grounding in deep models improves multi-modal AI capabilities, in the context of vision and language. Specifically, I cover these four tasks: visual question answering, neural image captioning, visual dialog and vision and language pretraining. In visual question answering, we collected a large scale visual question answering dataset and I study various baselines to benchmark these tasks. To jointly reason about image and question, I propose a novel co-attention mechanism that can learn fine-grained grounding to answer the question. In image captioning, I address the model designs for grounded caption generation of a image. A key focus is to extend the model with the ability to know when to look at the image when generating each word. For the words which have explicit visual correspondence, we further proposed a novel approach that reconciles classical slot filling approaches with modern neural captioning approaches. As a result, our model can produce natural language explicitly grounded in entities that object detectors find in the image. In visual dialog, I study both sides of the visual dialog agents -- questioner and answerer. For modeling answerer which answers visual questions in dialog, I introduce a novel discriminant perceptual loss that transfers knowledge from a discriminative model a generative model. For modeling questioner, I consider an image guessing game as a test-bed for balancing task performance and language drift. I propose a Dialog without Dialog task, which requires agents to generalize from single round visual question generation with full supervision to a multi-round dialog-based image guessing game without direct language supervision. The proposed visually-grounded dialog models that can adapt to new tasks while exhibiting less linguistic drift. In vision and language pretraining, I study more general models that can learn visual groundings from massive meta-data on the internet. I also explore the multi-task vision and language representation learning. Our results not only show that a single model can perform all 12 vision and language tasks, but also that joint training can lead to improvements in task metric compared to single-task training with the same architecture. Through this work, I demonstrate that inducing appropriate grounding in deep models improves multi-modal AI capabilities. Finally, I briefly discuss the challenges in this domain and the extensions of recent works.
dc.description.degree	Ph.D.
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/1853/62745
dc.language.iso	en_US
dc.publisher	Georgia Institute of Technology
dc.subject	Computer vision
dc.subject	Natural language processing
dc.subject	Visual question answering
dc.subject	Multi-task learning
dc.subject	Deep learning
dc.title	Visually grounded language understanding and generation
dc.type	Text
dc.type.genre	Dissertation
dspace.entity.type	Publication
local.contributor.advisor	Parikh, Devi
local.contributor.advisor	Riedl, Mark O.
local.contributor.advisor	Hoffman, Judy
local.contributor.advisor	Batra, Dhruv
local.contributor.corporatename	College of Computing
local.contributor.corporatename	School of Computer Science
relation.isAdvisorOfPublication	2b8bc15b-448f-472b-8992-ca9862368cad
relation.isAdvisorOfPublication	6512b353-3315-4dd1-9f47-7aaef3e19300
relation.isAdvisorOfPublication	403cff3c-8f25-4db5-978b-ef617a9f8b6a
relation.isAdvisorOfPublication	bbee09e1-a4fa-4d99-9dfd-b0605fea0f11
relation.isOrgUnitOfPublication	c8892b3c-8db6-4b7b-a33a-1b67f7db2021
relation.isOrgUnitOfPublication	6b42174a-e0e1-40e3-a581-47bed0470a1e
thesis.degree.level	Doctoral