Search: capability limitations

Paper page - X2SAM: Any Segmentation in Images and Videos

…Hao Wang , , , , , , Abstract X2SAM is a unified multimodal model that extends segmentation capabilities from images to videos while supporting conversational instructions and visual prompts for both modalities. AI-generated summary Multimodal Large…

May 6, 2026

Paper page - Ideology Prediction of German Political Texts

…To better understand the political rhetoric from various movements, ranging from left to right, we propose a transformer-based model capable of projecting the political orientation of a text on a continuous…

May 15, 2026

Paper page - WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

…AI-generated summary While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical…

May 6, 2026

Paper page - FRAPPE: Full Input, Residual Output Autoencoding with Projection Pursuit Encoder

…Generated by Qwen/Qwen2.5-Coder-32B-Instruct Media compression standards have reached a plateau in terms of the rate-distortion-complexity trade-off , limiting the ability to offload expensive AI perception…

Jun 1, 2026

Paper page - Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

…AI-generated summary The continued improvements in language model capability have unlocked their widespread use as drivers of autonomous agents, for example in coding or computer use applications. However, the core of…

May 13, 2026

Paper page - Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

…Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations , where a…

May 14, 2026

Paper page - Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models

…We present Ultralytics YOLO 26, a unified real-time vision model family that addresses these limitations through coordinated architecture and training advances. YOLO 26 uses a dual-head design for native NMS…

Jun 3, 2026

Paper page - MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

…Xiyu Ren , , Yiming Du , , , , , , , , , , , Abstract A new benchmark evaluates memory capabilities in vision-language models through multi-session conversations, revealing limitations of both long-context and memory-augmented approaches. AI-generated summary…

May 15, 2026

Paper page - Benchmarking Visual State Tracking in Multimodal Video Understanding

…tracking in videos, performing poorly even when human-level capabilities are required, and existing agentic approaches do not effectively address these limitations. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Understanding a…