2024年8月28日 – arXiv最新論文の紹介

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models [71.4]
本稿では,多目的なマルチモーダル大言語モデルであるmPLUG-Owl3を提案する。具体的には、視覚と言語を共通の言語誘導意味空間に効率的に統合する新しいハイパーアテンションブロックを提案する。
論文参考訳（メタデータ） (Tue, 13 Aug 2024 08:10:32 GMT)
mPLUG-Owlのver 3
リポジトリはGitHub – X-PLUG/mPLUG-Owl: mPLUG-Owl: The Powerful Multi-modal Large Language Model Family

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations [51.1]
離散化は、画像やビデオのような連続したデータを離散トークンとして表現する。画像やビデオを識別する一般的な方法は、生のピクセル値のモデリングである。本研究では,画像やビデオを直接,標準コーデック(JPEG,AVC/H.264)を介してコンピュータ上に保存した圧縮ファイルとしてモデル化することを提案する。
論文参考訳（メタデータ） (Wed, 21 Aug 2024 00:24:53 GMT)
JPEGを直接扱えるL(?)Mの提案。「For generality, our models also do not use any vision-specific modules like convolutions or 2D positional embeddings, potentially making the task more challenging.」、「However, we observe that conventional, vanilla language modeling surprisingly conquers these challenges without special designs as training goes (e g , JPEG-LM generates realistic images barely with any corrupted JPEG patches).」とのこと。アーキテクチャは7B Llama-2 model、本当に強力。

Visual Agents as Fast and Slow Thinkers [88.7]
本稿では、Fast and Slow Thinking機構を視覚エージェントに組み込んだFaSTを紹介する。 FaSTは、システム1/2モードを動的に選択するためにスイッチアダプタを使用し、異なるタスクの複雑さに対する問題解決アプローチを調整している。モデルの信頼性を調整し、新しいコンテキストデータを統合することで、不確実で目に見えないオブジェクトに取り組む。
論文参考訳（メタデータ） (Fri, 16 Aug 2024 17:44:02 GMT)
かの有名なFast and SlowをMLLMエージェントに適用。「the concepts of System 1 (fast, intuitive) and System 2 (slow, deliberate) thinking into visual agents, aiming to enhance their reasoning and decision-making capabilities.」というコンセプト
効果があったとするが公平な比較になっているんだろうかという疑問がなくはない。