Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos [110.3]
Sa2VAは、画像とビデオの両方の基盤的理解のための統一モデルである。セグメンテーションや会話など、幅広い画像やビデオのタスクをサポートする。本稿では,複数のタスク,特にビデオオブジェクトのセグメンテーションにおいて,Sa2VAが最先端を実現することを示す。
論文参考訳（メタデータ） (Tue, 07 Jan 2025 18:58:54 GMT)
「By leveraging the knowledge from both LLaVA and SAM-2, our model has strong capabilities in both mask and language generation.」とのこと。なるほど、という感じ。
リポジトリはSa2VA

コメントを残す

コメントを残す コメントをキャンセル