MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct [148.4]
我々は、画像テキストの命令データをキュレートするための新しいフレームワークであるMMEvolを提案する。 MMEvolは、微粒な知覚の進化、認知的推論の進化、相互作用の進化を組み合わせている。提案手法は,3.1ポイントの平均精度向上を実現し,13の視覚言語タスクのうち9つで最先端(SOTA)性能に達する。
論文参考訳（メタデータ） (Mon, 9 Sep 2024 17:44:00 GMT)
「a novel multimodal instruction data evolution framework that combines fine-grained perception evolution, cognitive reasoning evolution, and interaction evolution.」、マルチモーダルな点が特徴的。効果は「The data evolved through three rounds of evolution is used to train a new model, demonstrating state-of-the-art (SOTA) performance across a comprehensive set of benchmarks.」としている。
テキストや数学的問題を超えて、マルチモーダルな文脈でも有効性が確かめられているのは面白いのと、今後の取り組みで画像生成モデルとの統合に言及があった点も興味深い。
プロジェクトサイトはMMEvol: Welcome (rainbowluocs.github.io)

コメントを残す

コメントを残す コメントをキャンセル