A Survey on Large Language Model Benchmarks – arXiv最新論文の紹介

A Survey on Large Language Model Benchmarks [45.0]
一般的な能力ベンチマークは、中核言語学、知識、推論などの側面をカバーする。ドメイン固有のベンチマークは、自然科学、人文科学、社会科学、エンジニアリング技術といった分野に焦点を当てている。ターゲット固有のベンチマークは、リスク、信頼性、エージェントなどに注意を払う。
論文参考訳（メタデータ） (Thu, 21 Aug 2025 08:43:35 GMT)
「We systematically review the current status and development of large language model benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain- specific, and target-specific.」とベンチマークのサーベイ
LLMの動きを広範に知るため様々なベンチマークが作られており、これら調査は非常にありがたい。

コメントを残す

コメントを残す コメントをキャンセル