Punching Bag vs. Punching Person: Motion Transferability in Videos

Center for Research in Computer Vision, University of Central Florida1
Center for Vision Technology, SRI International 2
ICCV '25

Dataset Overview

Dataset overview

Abstract

Action recognition models demonstrate strong generalization, but can they effectively transfer high-level motion concepts across diverse contexts, even within similar distributions? For example, can a model recognize the broad action "punching" when presented with an unseen variation such as "punching person"? To explore this, we introduce a motion transferability framework with three datasets: (1) Syn-TA, a synthetic dataset with 3D object motions; (2) Kinetics400-TA; and (3) Something-Something-v2-TA, both adapted from natural video datasets. We evaluate 13 state-of-the-art models on these benchmarks and observe a significant drop in performance when recognizing high-level actions in novel contexts. Our analysis reveals: 1) Multimodal models struggle more with fine-grained unknown actions than with coarse ones; 2) The bias-free Syn-TA proves as challenging as real-world datasets, with models showing greater performance drops in controlled settings; 3) Larger models improve transferability when spatial cues dominate but struggle with intensive temporal reasoning, while reliance on object and background cues hinders generalization. We further explore how disentangling coarse and fine motions can improve recognition in temporally challenging datasets. We believe this study establishes a crucial benchmark for assessing motion transferability in action recognition.

Dataset Details

Dataset Syn-TA K400-TA SSv2-TA
Set 1/Set 2 Set 1/Set 2 Set 1/Set 2
# Coarse classes 20 41 26
# Fine classes 53/47 111/94 81/68
# Train videos 3180/2820 73312/56291 78229/63611
# Test videos 2120/1880 5616/4664 11134/9241
# Total videos 10000 139883 162215

Model Performance

Realistic vs. Plain Background Performance

Model performance visualization
Model Coarse motions Fine motions
Realistic Plain Realistic Plain
Unimodal Models
ResNet50 [resnet50] 41.30 51.62 - -
I3D 51.17 61.18 - -
X3D 71.79 73.22 - -
SlowFast 61.45 72.71 - -
MViTv2 51.50 61.60 - -
Rev-MViT 47.98 50.95 - -
AIM 82.17 84.02 - -
UniformerV2 67.25 75.81 - -
Multimodal Models
ActionCLIP 70.27 78.21 53.85 56.48
X-CLIP 61.22 65.43 34.98 36.21
ViFi-CLIP 49.01 51.71 30.79 37.49
EZ-CLIP 68.38 77.17 38.71 51.01
FROSTER 46.91 59.52 33.26 44.87

Proposed architecture for disentanglement

BibTeX


        @InProceedings{Abdullah_2025_ICCV,
            author    = {Abdullah, Raiyaan and Claypoole, Jared and Cogswell, Michael and Divakaran, Ajay and Rawat, Yogesh Singh},
            title     = {Punching Bag vs. Punching Person: Motion Transferability in Videos},
            booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
            month     = {October},
            year      = {2025},
            pages     = {}
        }