FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction [84.4]
FutureXは、将来の予測のための最大かつ最も多様なライブベンチマークである。リアルタイムの日次更新をサポートし、質問収集と回答収集のための自動パイプラインを通じてデータの汚染を取り除く。推論,検索機能,外部ツールの統合などを含む25のLLM/エージェントモデルを評価した。
論文参考訳（メタデータ） (Sat, 16 Aug 2025 08:54:08 GMT)
未来予測のためのライブベンチマーク。「we introduce FutureX, a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is built upon a semi-automated pipeline that continuously collects future-oriented questions from 195 diverse websites, curated from a pool of 2,008 sites covering areas such as politics, economics, technology, sports, healthcare, and more.」とドメインも広い。
結果として「LLM agents still lag behind humans」ではあるものの、レベル２は人を上回っているエージェントがいるのが興味深いところ。（あとレベル分けは若干違和感がある。。。）
- The Basic tier (Level 1) contains single-choice events with options fewer than 4.
- The Wide Search tier (Level 2) comprises multi-choice events with several correct answers.
- The Deep Search tier (Level 3) contains open-ended events whose underlying facts are relatively stable (with low volatility).
- The Super Agent tier (Level4) covers high-volatility, open-ended events.

コメントを残す

月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル