Paper page - SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?
…under standard prompting, models frequently rate low-soundness proposals as sound, while aggressive prompting largely shifts errors from false positives to false negatives. Additional controls for public-corpus contamination, paper-identifying phrases…