Paper page - WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts
… Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5 . …