Search: model releases

Paper page - WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

…These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation…

May 15, 2026

Paper page - WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

…model evaluation . Across modern video generators, our results expose a persistent gap between visual plausibility and world reasoning: videos can look convincing while failing dynamics, causality, or information preservation. We will release…

May 12, 2026

Paper page - EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

…Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions Published on Apr 30 Submitted by Weiyu Sun on May 8 Authors: Weiyu Sun , , , , , Abstract EDU-CIRCUIT-HW…

May 8, 2026

Welcome GPT OSS, the new open-source model family from OpenAI!

You arrive at OpenAI for ChatGPT press release · they greet you -- they gift you some merch a-wesome · i'd even say it's an oss-ome model How to spit on…

May 1, 2026 · Vaibhav Srivastav

Paper page - GLiNER-Relex: A Unified Framework for Joint Named Entity Recognition and Relation Extraction

…The model is released as an open-source Python package with a simple inference API that allows users to specify arbitrary entity and relation type labels at inference time and obtain both…

May 13, 2026

Paper page - AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

…5% to 23.2%, and some models move by as many as five rank positions when ranked by quality score instead of pass rate. We release the anonymized project repository, including the…

May 14, 2026

Paper page - Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

…We are releasing model checkpoints in BF16, FP8, and FP4 formats, along with portions of the training data and codebase to facilitate further research and development. View arXiv page View PDF Add…

May 1, 2026

Paper page - Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

…On this subset, no model exceeds 50%, identifying refusal as a new optimization target that current models do not directly address. To prevent contamination, the dataset will be publicly released in late…

May 12, 2026

Paper page - When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning

…Jiaqi Wei , , , , , , Qingyun Wang , Abstract Side-by-Side Interleaved Reasoning enables controlled disclosure timing in autoregressive models, improving accuracy and efficiency through interleaved private reasoning and delayed content release. AI-generated summary…

May 7, 2026

Paper page - Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Environments

…Extensive experiments on chest X-ray interpretation demonstrate that our 7B model achieves superior robustness, outperforming even proprietary source models in average accuracy. Furthermore, we release CXR-MAX, a large-scale benchmark…