Attention and Feature Representation in Spatiotemporal Networks

How can we quantify the static and dynamic biases of action recognition models?

Models learn features in order to classify videos, but the internal decision-making process is often opaque. Videos have spatial and temporal dimensions and thus exhibit two kinds of features. Both models and datasets can have these feature biases.

Term Definition
Static Features Derived from single frames, e.g. texture, color, shape.
Dynamic Features Derived from multiple frames, e.g motion.
Animated GIF

Synthetic Dataset

Real-world videos do not allow for control over exact static or dynamic information; they often contain multiple moving parts, each with its own complex compositions of motions. Hence, we generated a synthetic dataset that allows us to specify the motion and attributes of objects. Each video contains a single ‘target’ object tracing a predefined motion pattern through time, and ‘distractor’ objects in the background. We can therefore use the object’s trajectory as a basis for the dynamic features we expect the model to pick up on.

Model Visualizations

To visualize the decision-making process of spatiotemporal models, we use Grad-CAM and a few of its variants, Grad-CAM++ and EigenCAM. GradCAM is a technique that generates coarse localization maps, which highlight the regions of an input that influence a model’s decision-making process. By computing the gradients of a model’s output with respect to the activations of an internal layer, we are able to construct 3D “explanations” of a model’s prediction for any video we pass into it.

Current Results

As demonstrated by the 3D volumes above, the prediction’s localization map does not fully match the target object’s true trajectory. The preliminary results reveal that the model does not need every frame in order to classify a motion with high accuracy; this suggests that the defining dynamic features of our synthetic motions may not extend throughout an entire motion trajectory.

Authors

Further Reading