Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data [100.5]
Streferはビデオ大モデルに参照と推論機能を持たせるために設計された合成データ生成フレームワークである。 Streferは、時間的に密度が高くきめ細かなビデオメタデータを擬似アノテーションするデータエンジンを使用して、多様な命令生成データを生成する。我々のアプローチは、ビデオLLMが空間的および時間的参照を解釈する能力を高め、現実のAIコンパニオンに不可欠な、より汎用的で時空間対応の推論を育む。
論文参考訳（メタデータ） (Wed, 03 Sep 2025 17:33:20 GMT)
「Our approach begins with a modular framework that orchestrates multiple agents—including pretrained Large Language Models (LLMs), Video LLMs, and Pixel-Level Multimodal Vision Foundation Models (e g , RexSeek [20], GroundingDINO [32] and SAM2 [44])—to pseudo-annotate video metadata with temporally dense and object-centric space-time information. This metadata captures detailed spatial and temporal structures, such as subjects, objects, their locations as masklets (segmentation masks tracked over time), and action timelines. Building on this structured metadata, we leverage in-context learning and well-defined task schemas to guide LLMs in generating high-utility instruction data for tuning Video LLMs.」と凝った構成による動画に対する合成データフレームワークの提案。
プロジェクトサイトはStrefer: Data Engine for Video LLMs

コメントを残す

月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

コメントを残す コメントをキャンセル

コメントを残すコメントをキャンセル