Summary

Video data captures a tremendous amount of data that encompasses both visual and semantic knowledge. Traditional approaches to video activity understanding is based on training machine learning models or, more recently, a variety of deep learning approaches to capture underlying semantics of the video using human-annotated training data. However, this restricts the trained models to the ontology given by the annotations. A deeper understanding of video activities extends beyond recognition of underlying concepts such as actions and objects: constructing deep semantic representations requires reasoning about the semantic relationships among these concepts, often beyond what is directly observed in the data. We propose an energy minimization framework that leverages large-scale commonsense knowledge bases, such as ConceptNet, to provide contextual cues to establish semantic relationships among entities directly hypothesized from video signal. We mathematically express this using the language of Grenander's canonical pattern generator theory. We show that the use of prior encoded commonsense knowledge alleviate the need for large annotated training datasets and help tackle imbalance in training through prior knowledge. Through extensive experiments, we show that the use of commonsense knowledge from ConceptNet allows the proposed approach to handle various challenges such as training data imbalance, weak features, and complex semantic relationships and visual scenes. We also find that the use of commonsense knowledge allows for highly interpretable models that can be used in a dialog model for better human-machine interaction.

People

Example

Groundtruth

Activity Interpretation

Groundtruth

Activity Interpretation

Watch Laptop
Watch Laptop
A person is playing a guitar.
A person is playing a guitar.

Papers Published