Paper page - Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases
…Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT…