A nested motion descriptor is a spatiotemporal representation of motion that is invariant to global camera translation, without requiring an explicit estimate of optical flow or camera stabilization. This descriptor is a natural spatiotemporal extension of the nested shape descriptor to the representation of motion. We demonstrate that the quadrature steerable pyramid can be used to pool phase, and that pooling phase rather than magnitude provides an estimate of camera motion. This motion can be removed using the log-spiral normalization as introduced in the nested shape descriptor. Furthermore, this structure enables an elegant visualization of salient motion using the reconstruction properties of the steerable pyramid. We compare our descriptor to local motion descriptors, HOG-3D and HOGHOF, and show improvements on three activity recognition datasets.


This video shows perspective views of the spatiotemporal pooling regions of the nested motion descriptor.  This view shows that the temporal pooling regions increase proportionally to spatial scale, and that the spatial pooling regions are equivalent to the nested shape descriptor.  The slope of the line connecting the centers is determined by the velocity tuning of the descriptor.


The nested motion descriptor represents salient motion in video.  These videos show a semitransparent saliency map for motion overlayed on each frame of video. This saliency map shows salient responses in red and non-salient in blue. (top left) Salient motion for HMDB ”basketball dribbling” using NMD with log spiral normalization. (top right) NMD without log-spiral normalization includes motion of the camera. (bottom left) Motion saliency for HMDB ”rock climbing”. (bottom right) NMD without log-spiral normalization. The log-spiral normalization suppresses the significant camera motion in the scene focusing on the salient motion of the rock climbers.

  1. J. Byrne, “Nested Motion Descriptors”, Computer Vision and Pattern Recognition (CVPR), Boston MA, 2015 [pdf]