(Georgia Institute of Technology, 2022-05)
Chaganti, Sidhartha
Multimodal learning enables networks to consider mulitple perspectives or modalities of a scene when performing activity recognition. We propose networks which use attention to better focus on select portions of data along both embedding-space and time. Additionally, we propose using a projection network to project from one modality to another in order to use a multimodal network even when the second modality is unavailable during inference time. We observe that adding attention leads to better performance and that using projected data retains most of the performance from the multimodal architectures.