Paper page - REPOT: Recoverable Program-of-Thought via Checkpoint Repair
… No fine-tuning, no rollout-time search. Results on PuzzleZoo-775 Average about +3 to +11 pp over vanilla Program-of-Thought across four closed-model configurations gpt-5.4-mini ± reasoning, gemini-3.5-flash, claude-sonnet-4.6 , peaking at 96.9% vs 86.3% on gpt-5.4-mini-medium . …