Recent advances in egocentric video understanding models are promising, but their heavy computational expense is a barrier for many real-world applications. To address this challenge, we propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features by combining the semantics from a sparse set of video frames with the head motion from lightweight IMU readings. We further devise a novel self-supervised training strategy for IMU feature learning. Our method leads to significant improvements in efficiency, requiring 200x fewer GFLOPs than equivalent video models. We demonstrate its effectiveness on the Ego4D and EPICKitchens datasets, where our method outperforms state-ofthe- art efficient video understanding methods.
We show the EgoDistill architecture in the above figure. Left: Self-supervised IMU feature learning. Given start and end frames of a clip, we train the IMU encoder to anticipate visual changes. Right: Video feature distillation with IMU. Given image frame(s) and IMU, along with our pre- trained IMU encoder, our method trains a lightweight model with knowledge distillation to reconstruct the features from a heavier video model. When the input includes more than one image frame, the image encoder aggregates frame features temporally with a GRU.