SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading [100.3]
LLM(Large Language Models)の一般的な用途は、科学的トピックに関するタスクを実行することである。そこで本稿では,大学生のこのような課題に対する評価方法に着想を得たSciExを提案する。我々は,新しいベンチマークを用いて,最先端のLLMの性能評価を行った。
論文参考訳（メタデータ） (Fri, 14 Jun 2024 21:52:21 GMT)
大学生のを対象とした試験のベンチマーク「SciEx is (1) multilingual, containing both English and German exams, and (2) multi-modal, containing questions that involve images, and (3) contains various types of freeform questions with different difficulty levels, due to the nature of university exams.」とのこと。意外なことに（？）GPT-4VよりもClaude Opusのほうが高いスコア。
リポジトリはtuanh23/SciEx · Datasets at Hugging Face

コメントを残す

コメントを残す コメントをキャンセル