Paper page - IntentGrasp: A Comprehensive Benchmark for Intent Understanding
…Notably, 17 out of 20 tested models perform worse than a random-guess baseline (15.2%) on Gem Set, while the estimated human performance is ~81.1%, showing substantial room for improvement…