2025年10月28日 – arXiv最新論文の紹介

World-in-World: World Models in a Closed-Loop World

World-in-World: World Models in a Closed-Loop World [123.9]
我々は,実エージェントと環境の相互作用を反映したクローズドループの世界において,世界モデルをベンチマークする最初のオープンプラットフォームであるWorld-in-Worldを紹介した。多様なWMを厳格に評価し、タスク成功を主要な指標として優先順位付けし、視覚的品質に重点を置く4つのクローズドループ環境をキュレートする。 1)視覚的品質だけではタスクの成功は保証されないが、制御可能性の方が重要であること、2) 行動観測データによる後トレーニングのスケーリングは、事前訓練されたビデオジェネレータをアップグレードするよりも効果的であること、3) 推論時計算の割り当てにより、WMsは大幅にクローズドな改善が可能であること、の3つのサプライズを明らかにした。
論文参考訳（メタデータ） (Mon, 20 Oct 2025 22:09:15 GMT)
World model としてのViusual Generationモデルに対してのベンチマーク。VisualなクオリティとWorld modelとしてのクオリティにはギャップがあるとの指摘。
- We introduce World-in-World, the first comprehensive closed-loop benchmark that evaluates world models through the lens of embodied interaction, moving beyond the common focus on generation quality. • We propose a unified closed-loop planning strategy with a unified action API, allowing diverse world models to be seamlessly integrated and evaluated within a single framework across four embodied tasks.
- We introduce World-in-World, the first comprehensive closed-loop benchmark that evaluates world models through the lens of embodied interaction, moving beyond the common focus on generation quality.
- We propose a unified closed-loop planning strategy with a unified action API, allowing diverse world models to be seamlessly integrated and evaluated within a single framework across four embodied tasks.
- We discover that high visual quality does not necessarily guarantee task success, and demon- strate how the performance of pretrained video generators can be substantially improved through training-time data scaling and inference-time scaling.
プロジェクトサイトはWorld-in-World: World Models in a Closed-Loop World

LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition

LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition [104.8]
本稿では,Mortal Kombat IIにおける大規模マルチモーダルモデルを評価する新しいフレームワークであるLM Fight Arenaを紹介する。静的評価とは異なり、LM Fight Arenaは完全に自動化され、再現可能で、LMMの戦略的推論能力の客観的評価を提供する。
論文参考訳（メタデータ） (Fri, 10 Oct 2025 02:19:21 GMT)
「Unlike static evaluations, LM Fight Arena provides a fully automated, reproducible, and objective assessment of an LMM’s strategic reasoning capabilities in a dynamic setting. This work introduces a challenging and engaging benchmark that bridges the gap between AI evaluation and interactive entertainment.」とのことだが、なぜにMortal Kombat…
Claude 3.5 Sonnetがとても強いらしい。

LightMem: Lightweight and Efficient Memory-Augmented Generation

LightMem: Lightweight and Efficient Memory-Augmented Generation [72.2]
我々は、メモリシステムの性能と効率のバランスをとるLightMemという新しいメモリシステムを紹介した。人間の記憶のアトキンソン・シフリンモデルにインスパイアされたLightMemは、メモリを3つの相補的なステージにまとめる。 GPTとQwenのバックボーンを用いたLongMemEvalの実験では、LightMemは高いベースライン(最大10.9%のゲイン)を上回り、トークンの使用量を最大117倍に削減している。
論文参考訳（メタデータ） (Tue, 21 Oct 2025 17:58:17 GMT)
軽量かつ効率的なメモリーフレームワーク。「Inspired by the Atkinson–Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition- inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep- time update employs an offline procedure that decouples consolidation from online inference.」と3モジュール構成
リポジトリはGitHub – zjunlp/LightMem: LightMem: Lightweight and Efficient Memory-Augmented Generation

月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31