Person:
Essa, Irfan

Associated Organization(s)
Organizational Unit
ORCID
ArchiveSpace Name Record

Publication Search Results

Now showing 1 - 7 of 7
Thumbnail Image
Item

Leveraging Context to Support Automated Food Recognition in Restaurants

2015-01 , Bettadapura, Vinay , Thomaz, Edison , Parnam, Aman , Abowd, Gregory D. , Essa, Irfan

The pervasiveness of mobile cameras has resulted in a dramatic increase in food photos, which are pictures re- flecting what people eat. In this paper, we study how tak- ing pictures of what we eat in restaurants can be used for the purpose of automating food journaling. We propose to leverage the context of where the picture was taken, with ad- ditional information about the restaurant, available online, coupled with state-of-the-art computer vision techniques to recognize the food being consumed. To this end, we demon- strate image-based recognition of foods eaten in restaurants by training a classifier with images from restaurant’s on- line menu databases. We evaluate the performance of our system in unconstrained, real-world settings with food im- ages taken in 10 restaurants across 5 different types of food (American, Indian, Italian, Mexican and Thai).

Thumbnail Image
Item

Predicting Daily Activities From Egocentric Images Using Deep Learning

2015 , Castro, Daniel , Hickson, Steven , Bettadapura, Vinay , Thomaz, Edison , Abowd, Gregory D. , Christensen, Henrik I. , Essa, Irfan

We present a method to analyze images taken from a passive egocentric wearable camera along with the contextual information, such as time and day of week, to learn and predict everyday activities of an individual. We collected a dataset of 40,103 egocentric images over a 6 month period with 19 activity classes and demonstrate the benefit of state-of-the-art deep learning techniques for learning and predicting daily activities. Classification is conducted using a Convolutional Neural Network (CNN) with a classification method we introduce called a late fusion ensemble. This late fusion ensemble incorporates relevant contextual information and increases our classification accuracy. Our technique achieves an overall accuracy of 83.07% in predicting a person's activity across the 19 activity classes. We also demonstrate some promising results from two additional users by fine-tuning the classifier with one day of training data.

Thumbnail Image
Item

Recognizing Water-Based Activities in the Home Through Infrastructure-Mediated Sensing

2012-09 , Thomaz, Edison , Bettadapura, Vinay , Reyes, Gabriel , Sandesh, Megha , Schindler, Grant , Plötz, Thomas , Abowd, Gregory D. , Essa, Irfan

Activity recognition in the home has been long recognized as the foundation for many desirable applications in fields such as home automation, sustainability, and healthcare. However, building a practical home activity monitoring system remains a challenge. Striking a balance between cost, privacy, ease of installation and scalability continues to be an elusive goal. In this paper, we explore infrastructure-mediated sensing combined with a vector space model learning approach as the basis of an activity recognition system for the home. We examine the performance of our single-sensor water-based system in recognizing eleven high-level activities in the kitchen and bathroom, such as cooking and shaving. Results from two studies show that our system can estimate activities with overall accuracy of 82.69% for one individual and 70.11% for a group of 23 participants. As far as we know, our work is the first to employ infrastructure-mediated sensing for inferring high-level human activities in a home setting.

Thumbnail Image
Item

Egocentric Field-of-View Localization Using First-Person Point-of-View Devices

2015-01 , Bettadapura, Vinay , Essa, Irfan , Pantofaru, Caroline

We present a technique that uses images, videos and sensor data taken from first-person point-of-view devices to perform egocentric field-of-view (FOV) localization. We define egocentric FOV localization as capturing the visual information from a person’s field-of-view in a given environment and transferring this information onto a reference corpus of images and videos of the same space, hence determining what a person is attending to. Our method matches images and video taken from the first-person perspective with the reference corpus and refines the results using the first-person’s head orientation information obtained using the device sensors. We demonstrate single and multi-user egocentric FOV localization in different indoor and outdoor environments with applications in augmented reality, event understanding and studying social interactions.

Thumbnail Image
Item

Video Based Assessment of OSATS Using Sequential Motion Textures

2014 , Sharma, Yachna , Bettadapura, Vinay , Plötz, Thomas , Hammerla, Nils , Mellor, Sebastian , McNaney, Roisin , Olivier, Patrick , Deshmukh, Sandeep , McCaskie, Andrew , Essa, Irfan

We present a fully automated framework for video based surgical skill assessment that incorporates the sequential and qualitative aspects of surgical motion in a data-driven manner. We replicate Objective Structured Assessment of Technical Skills (OSATS) assessments, which provides both an overall and in-detail evaluation of basic suturing skills required for surgeons. Video analysis techniques are introduced that incorporate sequential motion aspects into motion textures. We also demonstrate significant performance improvements over standard bag-of-words and motion analysis approaches. We evaluate our framework in a case study that involved medical students with varying levels of expertise performing basic surgical tasks in a surgical training lab setting.

Thumbnail Image
Item

Automated Assessment of Surgical Skills Using Frequency Analysis

2015 , Zia, Aneeq , Sharma, Yachna , Bettadapura, Vinay , Sarin, Eric L. , Clements, Mark A. , Essa, Irfan

We present an automated framework for visual assessment of the expertise level of surgeons using the OSATS (Objective Structured Assessment of Technical Skills) criteria. Video analysis techniques for extracting motion quality via frequency coefficients are introduced. The framework is tested on videos of medical students with different expertise levels performing basic surgical tasks in a surgical training lab setting. We demonstrate that transforming the sequential time data into frequency components effectively extracts the useful information differentiating between different skill levels of the surgeons. The results show significant performance improvements using DFT and DCT coefficients over known state-of-the-art techniques.

Thumbnail Image
Item

Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition

2013-06 , Bettadapura, Vinay , Schindler, Grant , Plötz, Thomas , Essa, Irfan

We present data-driven techniques to augment Bag of Words (BoW) models, which allow for more robust modeling and recognition of complex long-term activities, especially when the structure and topology of the activities are not known a priori. Our approach specifically addresses the limitations of standard BoW approaches, which fail to represent the underlying temporal and causal information that is inherent in activity streams. In addition, we also propose the use of randomly sampled regular expressions to discover and encode patterns in activities. We demonstrate the effectiveness of our approach in experimental evaluations where we successfully recognize activities and detect anomalies in four complex datasets.