Building The Imperfect Beast
… The jump in performance of Mythos over Opus 4.6 on Humanity’s Last Exam perhaps not ironic, and not multiple choice but including problems models cannot solve and domain experts – meaning people – struggle with but can solve and on the Charxiv reasoning benchmark checks how models reason from chart… …