Recent Vision-Language Models (VLMs) have demonstrated remarkable capabilities in text and image analysis (like ChatGPT, Gemini, Claude, Grok, Qwen, InternVL, etc.), but they still face significant challenges in processing and understanding video data, regardless of their scale. I focus on identifying failure cases, curating specialized datasets to evaluate these models, and proposing methods to boost accuracy. Advancing video understanding is critical for achieving AGI.

Publications (Ph.D.)

Main Proceedings

Workshops

Awards

  • ORCGS Doctoral Fellowship, UCF (2023-2024)

Reviewer experience

Direct assignment

  • CVPR ‘25

Part of CRCV

  • CVPR ‘24
  • ICCV ‘25
  • ICLR ‘25
  • ICML ‘24
  • NeurIPS ‘24, ‘25