arXiv – ページ 20 – arXiv最新論文の紹介

GameFactory: Creating New Games with Generative Interactive Videos

GameFactory: Creating New Games with Generative Interactive Videos [33.0]
本稿では,ゲームビデオ生成におけるシーンの一般化を探求するフレームワークであるGameFactoryを紹介する。オープンドメインの一般化を保ちつつ,アクション制御からゲームスタイルの学習を分離する多段階学習戦略を提案する。フレームワークを拡張して、自動回帰アクション制御可能なゲームビデオ生成を可能にし、無制限のインタラクティブなゲームビデオの作成を可能にします。
論文参考訳（メタデータ） (Tue, 14 Jan 2025 18:57:21 GMT)
「By learning action control from a small-scale first-person Minecraft dataset, this framework can transfer these control abilities to open-domain videos, ultimately allowing the creation of new games within open-domain scenes.」というフレームワーク提案。移動などの操作を反映した動画生成ができるのは面白いのと、これが転送可能ということはある程度モデルの中にその知識がありそうでそちらも興味深い。
リポジトリはGameFactory

Finding the Trigger: Causal Abductive Reasoning on Video Events

Finding the Trigger: Causal Abductive Reasoning on Video Events [59.2]
Causal Abductive Reasoning on Video Events (CARVE)は、ビデオ内のイベント間の因果関係を特定する。本稿では、時間空間と意味空間における映像イベントの関係を調査する因果イベント関係ネットワーク(CERN)を提案する。
論文参考訳（メタデータ） (Thu, 16 Jan 2025 05:39:28 GMT)
ビデオ内のイベントとその因果関係を特定、対象イベントの発生を説明する因果連鎖の仮説を生成するタスクCausal Abductive Reasoning on Video Events (CARVE)、データ作成及びそれを解くための Causal Event Relation Network (CERN)を提案。
実用上重要ではあるが難しそうなタスク

Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues

Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues [56.0]
我々の目的は、連続手話から音声言語テキストへの翻訳である。署名ビデオと追加のコンテキストキューを組み込む。文脈的アプローチが翻訳の質を著しく向上させることを示す。
論文参考訳（メタデータ） (Thu, 16 Jan 2025 18:59:03 GMT)
「(i) we propose a new LLM-based model that integrates visual signing and text features with contextual information, including video background descriptions and previous sentence translations;」というようにコンテキスト情報を活用した手話への機械翻訳アプローチの提案
リポジトリはLost in Translation, Found in Context

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs [103.0]
大規模言語モデルにおけるステップバイステップの視覚的推論を促進するための包括的フレームワークを提案する。マルチステップ推論タスクの評価に特化して設計された視覚推論ベンチマークを導入する。第二に,個々のステップの粒度で視覚的推論品質を評価する新しい指標を提案する。第3に、マルチステップのカリキュラム学習アプローチを用いて学習したLlamaV-o1という新しいマルチモーダル視覚推論モデルを提案する。
論文参考訳（メタデータ） (Fri, 10 Jan 2025 18:59:51 GMT)
マルチステップなVisual reasoningタスクのベンチマークVisual Reasoning-Chain (VRCBench)の提案とcurriculum learningを通してLlama-3.2-11B-Vision-Instruct を強化したモデルの構築。omkarthawakar/LlamaV-o1 · Hugging Face
商用モデルに近い性能を発揮。
プロジェクトサイトはLlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Enabling Scalable Oversight via Self-Evolving Critic

Enabling Scalable Oversight via Self-Evolving Critic [59.9]
SCRIT(Self-evolving CRITic)は、批評能力の真の自己進化を可能にするフレームワークである。コントラストベースの自己批判によって生成される合成データのトレーニングによって自己改善する。最大で10.3%の改善が達成されている。
論文参考訳（メタデータ） (Fri, 10 Jan 2025 05:51:52 GMT)
SCRIT (Selfevolving CRITic)「Technically, SCRIT self-improves by training on synthetic data, generated by a contrastive-based selfcritic that uses reference solutions for step-by-step critique, and a self-validation mechanism that ensures critique quality through correction outcomes.」の提案
Qwen2.5-72B-Instructをベースモデルとして改善を確認とのこと

MiniMax-01: Scaling Foundation Models with Lightning Attention

MiniMax-01: Scaling Foundation Models with Lightning Attention [59.4]
MiniMax-Text-01とMiniMax-VL-01は、より長いコンテキストを処理するのに優れた機能を提供する。 MiniMax-Text-01は、トレーニング中に最大100万のトークンに到達でき、推論時に400万のトークンを安価な価格で外挿できる。私たちのビジョン言語モデルであるMiniMax-VL-01は、512億のビジョン言語トークンによる継続的なトレーニングによって構築されます。
論文参考訳（メタデータ） (Tue, 14 Jan 2025 18:50:05 GMT)
456B（32エキスパート、アクティブパラメータ 45.9B）のMoE構成の大規模な公開LLM。性能はGPT-4oなど商用モデルに匹敵するうえ、扱えるコンテキスト長が4Mトークンととても長い。「We demonstrate the first successful large-scale implementation of linear attention.」と主張（「After extensive experimentation, we settled on a hybrid architecture mainly using lightning attention (Qin et al , 2024b), an I/O-aware implementation of a linear attention variant (Qin et al , 2022a).」ともある通りハイブリッド構成）。
リポジトリはGitHub – MiniMax-AI/MiniMax-01

Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains [114.8]
大規模言語モデル(LLM)は近年顕著なパフォーマンスを達成しているが、基礎となるトレーニングデータによって根本的に制限されている。本稿では,言語モデルのマルチエージェント社会にファインタニングを適用した自己改善への補完的アプローチを提案する。
論文参考訳（メタデータ） (Fri, 10 Jan 2025 04:35:46 GMT)
「Instead of fine-tuning a single model, our method finetunes a multiagent set of language models from the same base model and then independently specializes each model to capture parts of a task of interest.」という自己改善アプローチの提案。Generation ModelとCritic Modelを同時にチューニングしていき、マルチエージェントなディベートを通して統合という動き。Critic modelの重要性も高そう。
リポジトリはMultiagent Finetuning: Self Improvement with Diverse Reasoning Chains

WebWalker: Benchmarking LLMs in Web Traversal

WebWalker: Benchmarking LLMs in Web Traversal [55.4]
WebWalkerQAは,LLMがWebトラバースを実現する能力を評価するためのベンチマークである。本稿では,WebWalkerを提案する。WebWalkerは,探索的・批判的パラダイムを通じて,人間のようなWebナビゲーションを模倣するマルチエージェントフレームワークである。
論文参考訳（メタデータ） (Mon, 13 Jan 2025 18:58:07 GMT)
「It evaluates the capacity of LLMs to traverse a website’s subpages to extract high-quality data systematically.」というWEBサイトをめぐりながら必要な情報をとれるか否かのベンチマークWebWalkerQAとそれを解くためのマルチエージェントフレームワークWebWalkerの提案。Agenticな動作を行い、かつ、GPT-4oなど先端モデルを使っても解くのが難しいデータセットになっている。（やや意外）
プロジェクトサイトはWebWalker、リポジトリはGitHub – Alibaba-NLP/WebWalker: 🌐 WebWaker: Benchmarking LLMs in Web Traversal、WebWalkerQALeaderboard – a Hugging Face Space by callanwuもある

What Limits LLM-based Human Simulation: LLMs or Our Design?

What Limits LLM-based Human Simulation: LLMs or Our Design? [43.5]
我々は, LLMに基づく人間シミュレーションの進展には, LLM固有の制約とシミュレーションフレームワークの設計課題の両方に対処する必要があると論じている。この分野でのさらなる研究を支援するため、我々はLLMに基づく人体シミュレーションリソースのキュレートされたコレクションを提供する。
論文参考訳（メタデータ） (Wed, 15 Jan 2025 04:59:49 GMT)
「LLM-based human simulation」の課題分析、整理。「Compared to tasks in NLP or CV, LLM-based human simulations present a much greater complexity」はそうだろうと思う。
リポジトリはGitHub – Persdre/llm-human-simulation: Collection of papers related to llm human simulation

The Tabular Foundation Model TabPFN Outperforms Specialized Time Series Forecasting Models Based on Simple Features

The Tabular Foundation Model TabPFN Outperforms Specialized Time Series Forecasting Models Based on Simple Features [40.2]
本稿では,TabPFNと単純な特徴工学を組み合わせ,予測性能を高めるための簡単なアプローチであるTabPFN-TSを提案する。その単純さとわずか1100万のパラメータにもかかわらず、TabPFN-TSは類似サイズのモデルであるChronos-Miniよりも優れており、65倍のパラメータを持つChronos-Largeよりもわずかに優れている。
論文参考訳（メタデータ） (Mon, 06 Jan 2025 11:38:19 GMT)
なかなか難しい感のあるTabular Foundation Modelの提案。「By using a simple set of timestampderived features, our approach matches or slightly outperforms Chronos-T5 (Large), which, to our knowledge, is one of the strongest time series foundation models.」とのこと。時系列データの基礎的な動きを捉えられているのかもしれないが、使う場合はそのドメインでの検証はした方が良いのだろうなと思う。
リポジトリはGitHub – PriorLabs/tabpfn-client: ⚡ Easy API access to the tabular foundation model TabPFN ⚡

2025年4月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30