Organizational Unit:
School of Interactive Computing

Research Organization Registry ID
Description
Previous Names
Parent Organization
Parent Organization
Organizational Unit
Includes Organization(s)

Publication Search Results

Now showing 1 - 6 of 6
  • Item
    Encoding 3D contextual information for dynamic scene understanding
    (Georgia Institute of Technology, 2020-04-27) Hickson, Steven D.
    This thesis aims to demonstrate how using 3D cues improves semantic labeling and object classification. Specifically, we will consider depth, surface normals, object classification, and pixel-wise semantic labeling in this work. The works outlined in this document aim to validate the following thesis statement: Shape, used as an additional context, improves segmentation, unsupervised clustering, object classification and semantic labeling with little computational overhead. The thesis will show that: Combining shape and object labels improves results while (1) requiring few extra parameters, (2) provides better results using surface normals than depth, and (3) combining shape with labels improves accuracy for each task.} We describe various methods to combine shape and object classification and then discuss our extensions of the work which focus on surface normal prediction, depth prediction, and semantic labeling specifically.
  • Item
    Leveraging mid-level representations for complex activity recognition
    (Georgia Institute of Technology, 2019-01-16) Ahsan, Unaiza
    Dynamic scene understanding requires learning representations of the components of the scene including objects, environments, actions and events. Complex activity recognition from images and videos requires annotating large datasets with action labels which is a tedious and expensive task. Thus, there is a need to design a mid-level or intermediate feature representation which does not require millions of labels, yet is able to generalize to semantic-level recognition of activities in visual data. This thesis makes three contributions in this regard. First, we propose an event concept-based intermediate representation which learns concepts via the Web and uses this representation to identify events even with a single labeled example. To demonstrate the strength of the proposed approaches, we contribute two diverse social event datasets to the community. We then present a use case of event concepts as a mid-level representation that generalizes to sentiment recognition in diverse social event images. Second, we propose to train Generative Adversarial Networks (GANs) with video frames (which does not require labels), use the trained discriminator from GANs as an intermediate representation and finetune it on a smaller labeled video activity dataset to recognize actions in videos. This unsupervised pre-training step avoids any manual feature engineering, video frame encoding or searching for the best video frame sampling technique. Our third contribution is a self-supervised learning approach on videos that exploits both spatial and temporal coherency to learn feature representations on video data without any supervision. We demonstrate the transfer learning capability of this model on smaller labeled datasets. We present comprehensive experimental analysis on the self-supervised model to provide insights into the unsupervised pretraining paradigm and how it can help with activity recognition on target datasets which the model has never seen during training.
  • Item
    Leveraging contextual cues for dynamic scene understanding
    (Georgia Institute of Technology, 2016-01-08) Bettadapura, Vinay Kumar
    Environments with people are complex, with many activities and events that need to be represented and explained. The goal of scene understanding is to either determine what objects and people are doing in such complex and dynamic environments, or to know the overall happenings, such as the highlights of the scene. The context within which the activities and events unfold provides key insights that cannot be derived by studying the activities and events alone. \emph{In this thesis, we show that this rich contextual information can be successfully leveraged, along with the video data, to support dynamic scene understanding}. We categorize and study four different types of contextual cues: (1) spatio-temporal context, (2) egocentric context, (3) geographic context, and (4) environmental context, and show that they improve dynamic scene understanding tasks across several different application domains. We start by presenting data-driven techniques to enrich spatio-temporal context by augmenting Bag-of-Words models with temporal, local and global causality information and show that this improves activity recognition, anomaly detection and scene assessment from videos. Next, we leverage the egocentric context derived from sensor data captured from first-person point-of-view devices to perform field-of-view localization in order to understand the user's focus of attention. We demonstrate single and multi-user field-of-view localization in both indoor and outdoor environments with applications in augmented reality, event understanding and studying social interactions. Next, we look at how geographic context can be leveraged to make challenging ``in-the-wild" object recognition tasks more tractable using the problem of food recognition in restaurants as a case-study. Finally, we study the environmental context obtained from dynamic scenes such as sporting events, which take place in responsive environments such as stadiums and gymnasiums, and show that it can be successfully used to address the challenging task of automatically generating basketball highlights. We perform comprehensive user-studies on 25 full-length NCAA games and demonstrate the effectiveness of environmental context in producing highlights that are comparable to the highlights produced by ESPN.
  • Item
    Automatic eating detection in real-world settings with commodity sensing
    (Georgia Institute of Technology, 2016-01-07) Thomaz, Edison
    Motivated by challenges and opportunities in nutritional epidemiology and food journaling, ubiquitous computing researchers have proposed numerous techniques for automated dietary monitoring (ADM) over the years. Although progress has been made, a truly practical system that can automatically recognize what people eat in real-world settings remains elusive. This dissertation addresses the problem of ADM by focusing on practical eating moment detection. Eating detection is a foundational element of ADM since automatically recognizing when a person is eating is required before identifying what and how much is being consumed. Additionally, eating detection can serve as the basis for new types of dietary self-monitoring practices such as semi-automated food journaling. In this thesis, I show that everyday eating moments such as breakfast, lunch, and dinner can be automatically detected in real-world settings by opportunistically leveraging sensors in practical, off-the-shelf wearable devices. I refer to this instrumentation approach as "commodity sensing". The work covered by this thesis encompasses a series of experiments I conducted with a total of 106 participants where I explored a variety of sensing modalities for automatic eating moment detection. The modalities studied include first-person images taken with wearable cameras, ambient sounds, and on-body inertial sensors. I discuss the extent to which first-person images reflecting everyday experiences can be used to identify eating moments using two approaches: human computation, and by employing a combination of state-of-the-art machine learning and computer vision techniques. Furthermore, I also describe privacy challenges that arise with first-person photographs. Next, I present results showing how certain sounds associated with eating can be recognized and used to infer eating activities. Finally, I elaborate on findings from three studies focused on the use of on-body inertial sensors (head and wrists) to recognize eating moments both in a semi-controlled laboratory setting and in real-world conditions. I conclude by relating findings and insights to practical applications, and highlighting opportunities for future work.
  • Item
    Segmental discriminative analysis for American Sign Language recognition and verification
    (Georgia Institute of Technology, 2010-04-06) Yin, Pei
    This dissertation presents segmental discriminative analysis techniques for American Sign Language (ASL) recognition and verification. ASL recognition is a sequence classification problem. One of the most successful techniques for recognizing ASL is the hidden Markov model (HMM) and its variants. This dissertation addresses two problems in sign recognition by HMMs. The first is discriminative feature selection for temporally-correlated data. Temporal correlation in sequences often causes difficulties in feature selection. To mitigate this problem, this dissertation proposes segmentally-boosted HMMs (SBHMMs), which construct the state-optimized features in a segmental and discriminative manner. The second problem is the decomposition of ASL signs for efficient and accurate recognition. For this problem, this dissertation proposes discriminative state-space clustering (DISC), a data-driven method of automatically extracting sub-sign units by state-tying from the results of feature selection. DISC and SBHMMs can jointly search for discriminative feature sets and representation units of ASL recognition. ASL verification, which determines whether an input signing sequence matches a pre-defined phrase, shares similarities with ASL recognition, but it has more prior knowledge and a higher expectation of accuracy. Therefore, ASL verification requires additional discriminative analysis not only in utilizing prior knowledge but also in actively selecting a set of phrases that have a high expectation of verification accuracy in the service of improving the experience of users. This dissertation describes ASL verification using CopyCat, an ASL game that helps deaf children acquire language abilities at an early age. It then presents the "probe" technique which automatically searches for an optimal threshold for verification using prior knowledge and BIG, a bi-gram error-ranking predictor which efficiently selects/creates phrases that, based on the previous performance of existing verification systems, should have high verification accuracy. This work demonstrates the utility of the described technologies in a series of experiments. SBHMMs are validated in ASL phrase recognition as well as various other applications such as lip reading and speech recognition. DISC-SBHMMs consistently produce fewer errors than traditional HMMs and SBHMMs in recognizing ASL phrases using an instrumented glove. Probe achieves verification efficacy comparable to the optimum obtained from manually exhaustive search. Finally, when verifying phrases in CopyCat, BIG predicts which CopyCat phrases, even unseen in training, will have the best verification accuracy with results comparable to much more computationally intensive methods.
  • Item
    Collaborative annotation, analysis, and presentation interfaces for digital video
    (Georgia Institute of Technology, 2009-07-06) Diakopoulos, Nicholas A.
    Information quality corresponds to the degree of excellence in communicating knowledge or intelligence and encompasses aspects of validity, accuracy, reliability, bias, transparency, and comprehensiveness among others. Professional news, public relations, and user generated content alike all have their own subtly different information quality concerns. With so much recent growth in online video, it is also apparent that more and more consumers will be getting their information from online videos and that understanding the information quality of video becomes paramount for a consumer wanting to make decisions based on it. This dissertation explores the design and evaluation of collaborative video annotation and presentation interfaces as motivated by the desire for better information quality in online video. We designed, built, and evaluated three systems: (1) Audio Puzzler, a puzzle game which as a by-product of play produces highly accurate time-stamped transcripts of video, (2) Videolyzer, a video annotation system designed to aid bloggers and journalists collect, aggregate, and share analyses of information quality of video, and (3) Videolyzer CE, a simplified video annotation presentation which syndicates the knowledge collected using Videolyzer to a wider range of users in order to modulate their perceptions of video information. We contribute to knowledge of different interface methods for collaborative video annotation and to mechanisms for enhancing accuracy of objective metadata such as transcripts as well as subjective notions of information quality of the video itself.