Paper page - Benchmarking Visual State Tracking in Multimodal Video Understanding
…VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception…