The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality [70.5]
FACTS Leaderboardは、実際に正確なテキストを生成する言語モデルの能力を総合的に評価するオンラインのリーダーボードスイートである。このスイートは、4つの異なるサブリーダーボード上でのモデルのパフォーマンスを集約することで、事実性の総合的な尺度を提供する。
論文参考訳（メタデータ） (Thu, 11 Dec 2025 16:35:14 GMT)
「The FACTS Leaderboard introduced here is designed to address this need by providing a holistic evaluation suite. It aggregates performance across four specialized sub-leaderboards, each targeting a distinct dimension of factuality. 」というベンチマーク
- FACTS Multimodal tests a model’s ability to combine visual grounding with world knowledge to answer questions about an image.
- FACTS Parametric measures the model’s ability to use its internal knowledge accurately in factoid question use-cases.
- FACTS Search evaluates the practical and increasingly common use case of generating factual responses by interacting with a search tool.
- FACTS Grounding v2 is an updated version of FACTS Grounding, which tests grounding to a given document, with improved judges.
プロジェクトサイトはFACTS Benchmark Suite Leaderboard | Kaggle、フロンティアなモデルはやはり強い。Gemini 3 Pro previewのSearchはさすが。最新モデルでの検証結果が知りたいところ。

コメントを残す

月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル