Action Attention Project

Term	Definition
Static Features	Derived from single frames, e.g. texture, color, shape.
Dynamic Features	Derived from multiple frames, e.g motion.

Synthetic Dataset

Real-world videos do not allow for control over exact static or dynamic information; they often contain multiple moving parts, each with its own complex compositions of motions. Hence, we generated a synthetic dataset that allows us to specify the motion and attributes of objects. Each video contains a single ‘target’ object tracing a predefined motion pattern through time, and ‘distractor’ objects in the background. We can therefore use the object’s trajectory as a basis for the dynamic features we expect the model to pick up on.

Model Visualizations

To visualize the decision-making process of spatiotemporal models, we use Grad-CAM and a few of its variants, Grad-CAM++ and EigenCAM. GradCAM is a technique that generates coarse localization maps, which highlight the regions of an input that influence a model’s decision-making process. By computing the gradients of a model’s output with respect to the activations of an internal layer, we are able to construct 3D “explanations” of a model’s prediction for any video we pass into it.

Current Results

As demonstrated by the 3D volumes above, the prediction’s localization map does not fully match the target object’s true trajectory. The preliminary results reveal that the model does not need every frame in order to classify a motion with high accuracy; this suggests that the defining dynamic features of our synthetic motions may not extend throughout an entire motion trajectory.

Attention and Feature Representation in Spatiotemporal Networks

How can we quantify the static and dynamic biases of action recognition models?

Synthetic Dataset

Model Visualizations

Current Results

Authors

Further Reading